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Abstract 

The  objective  of  this  Automatic  Population  of 
Geospatial  Databases  (APGD)  Integrated  Feasibil¬ 
ity  Demonstration  (IFD)  effort  is  to  develop,  eval¬ 
uate,  and  demonstrate  the  enabling  technology  that 
will  permit  rapid,  robust,  automated  population  of 
geospatial  databases,  from  a  variety  of  imagery 
sources,  to  serve  synthetic  environments  and  im¬ 
agery  exploitation  applications.  SRI  International 
and  its  team  members  GDE  and  Vexcel  expect  to 
provide  procedures  that  will  radically  reduce  the 
need  for  human  intervention  in  the  extraction  of 
3-D  cartographic  features  and  their  attributes  from 
imagery  and  supporting  auxiliary  data.  The  SRI 
team  will  concentrate  on  the  features  that  are  the 
most  useful  for  the  above  target  applications  and 
the  most  time-consuming  for  a  user  to  extract  man¬ 
ually,  such  as  roads,  rivers,  communication  lines, 
buildings,  and  land  cover.  We  plan  to  extend  our 
“context-based  vision”  paradigm  to  provide  a  flexi¬ 
ble  architecture  for  incorporating  and  appropriately 
applying  existing  and  future  highly  competent,  but 
specialized  feature  extraction  algorithms.  For  eval¬ 
uation  purposes  and  in  support  of  the  FRE  contrac¬ 
tors,  we  will  construct  one  or  more  “ground  truth” 
datasets,  run  independent  benchmarks,  and  provide 
a  mechanism  for  testing  a  new  module  within  the 
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latest  version  of  the  full  APGD-IFD  platform. 

1  Introduction 

Current  practice  in  populating  geospatial  databases 
requires  human  involvement  in  almost  all  phases  of 
the  effort.  Even  though  the  lU  community  has  pro¬ 
duced  a  large  body  of  effective  scene  analysis  tech¬ 
niques,  there  are  few  (if  any)  completely  automated 
and  robust  end-to-end  processes  for  extracting  any 
of  the  50  to  200  features  of  probable  interest.  How¬ 
ever,  we  believe  that  revolutionary  gains  in  produc¬ 
tivity  will  be  possible  in  the  near  term,  by  creating 
an  architecture  that  automates  the  “core”  operations 
of  the  feature  extraction  process:  selection  and  ini¬ 
tialization  of  algorithms,  selection  of  their  parame¬ 
ters,  and  evaluation  of  the  results. 

A  critical  problem  is  that  most  of  the  currently 
available  fully  automated  algorithms  were  devel¬ 
oped  and  tested  in  restricted  domains,  and  conse¬ 
quently  will  probably  fail  if  their  implicit  assump¬ 
tions  are  not  satisfied.  Furthermore,  many  of  these 
assumptions  and  domain  limitations  are  unknown  to 
the  algorithm  developers  themselves,  let  alone  the 
application-oriented  user. 

Researchers  at  SRI  International  have  developed  an 
approach  called  context-based  vision  [Strat  and  Fis¬ 
chler,  1991],  in  which  a  collection  of  specialized  al¬ 
gorithms  is  assembled  to  perform  various  extraction 
tasks  on  different  types  of  data.  Each  algorithm  is 
accompanied  by  a  body  of  information  that  makes 
explicit  as  much  as  is  known  about  the  conditions 
under  which  the  algorithm-parameter-image  combi¬ 
nations  will  (or  will  not)  work.  Given  a  specific  task 
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to  be  performed,  the  system  selects  the  appropriate 
algorithms  to  be  applied  and  the  parameters  to  be 
used. 

Our  system  for  APGD  is  called  the  BOS  (Bat- 
tlespace  Observer  System).  The  scenario  we  en¬ 
vision  is  one  in  which  a  small  amount  of  task 
specification  and  initialization  is  performed  by  a 
human  operator.  The  system  then  runs  unat¬ 
tended  to  populate,  and  continuously  update,  the 
database  with  the  extracted  features.  Some  hu¬ 
man  inspection  and  editing  will  probably  be  re¬ 
quired.  The  nominal  mode  of  operation  is  one  of 

enhancement/intensification/update/confirmation  of 

the  database  where  there  already  exists  some  prior 
(perhaps  crude)  model.  Cold-start  will  be  possi¬ 
ble  because  of  the  facilities  provided  by  the  un¬ 
derlying  RCDE  [Heller  and  Quam,  1997],  but  this 
will  require  an  increased  amount  of  human  inter¬ 
action.  Somewhat  paradoxically,  it  is  clear  that  re¬ 
ducing  human  interaction  requires  well-engineered 
(and  perhaps  extensive)  facilities  for  those  tasks  the 
human  must  perform  to  support  the  automated  func¬ 
tions.  The  RCDE/HUB  system  has  been  designed 
to  allow  effective  human  interaction  with  automated 
and  semiautomated  algorithms  to  accomplish  the 
task  of  constructing  3-D  scene  models  from  images, 
sensor  data,  and  stored  knowledge. 

A  major  expected  result  of  this  APGD  project  will 
be  a  revolutionary  reduction  in  the  effort  and  time 
needed  to  construct  3-D  attributed  scene  models, 
thereby  making  feasible  widespread  use  of  already- 
proven  simulation,  analysis,  and  visualization  tech¬ 
niques,  and  enabling  new  applications.  For  ex¬ 
ample,  new  intelligence  products  (e.g.,  hyperlink- 
annotated  VRML  models,  texture  mapped  with  cur¬ 
rent  imagery  that  can  be  retrieved  via  the  World- 
Wide  Web  (WWW)  or  Intelink,  or  as  an  enabling 
technology  for  implementing  “intelligent  push”  for 
battlefield  dissemination  of  tactically  relevant  im¬ 
ages  that  have  been  identified  by  an  automatic 
model-based  “Quick-Look”  process. 

1.1  Approach 

Our  approach  includes  the  following  components. 

•  An  overall  modular  and  expandable  sys¬ 
tem  architecture  -  called  the  BOS  (for 
the  Battlespace  Observer  System,  see  Fig¬ 


ure  1)  -  based  on  a  rigorous  photogrammet- 
ric/geometric  framework,  that  can  accept  in¬ 
puts  from  a  wide  variety  of  sensor  and  other 
information  sources,  and  is  able  to  support  a 
large  number  of  distinct  Battlefield  Awareness 
applications  (in  particular,  those  concerned 
with  constructing  synthetic  environments  and 
model-supported  exploitation).  Using  the  cur¬ 
rent  RADIUS/RCDE  as  a  base,  the  proposed 
architecture  will  provide  facilities  for  manual 
interaction  where  required  (especially  for  ini¬ 
tialization  and  editing)  and  the  cartographic  in¬ 
frastructure  to  support  automated  algorithms 
without  requiring  them  to  provide  their  own 
machinery  to  interact  with  a  geo-referenced 
model. 

A  context-based  algorithm  control  system 
(CBACS)  that  selects  and  parameterizes  algo¬ 
rithms  to  assure  their  effective  use.  CBACS 
provides  a  modular  way  of  integrating  a  large 
number  of  existing  feature  extraction  tech¬ 
niques,  that  work  well  in  narrow/specific  sit¬ 
uations,  into  a  feature  extraction  and  attribu¬ 
tion  system  that  is  reliable  over  the  broad  range 
of  conditions  that  will  be  encountered  in  real- 
world  environments. 

>  A  sensor  calibration  and  control  subsystem 
that  supports  cross-sensor  analysis  of  NTM, 
SAR,  IFSAR,  IR,  panchromatic,  and  hyper- 
spectral  imagery.  This  subsystem  includes  fa¬ 
cilities  for  photogrammetrically  rigorous  er¬ 
ror  analysis  and  propagation,  efficient  proce¬ 
dures  for  applying  image-to-world  and  world- 
to-image  coordinate  system  transformations, 
and  a  uniform  Application  Programming  Inter¬ 
face  (API)  for  all  sensor  types. 

>  A  collection  of  algorithms  for  feature  ex¬ 
traction  and  attribution.  Feature  Extraction 
Managers  will  control  the  extraction  process 
for  each  class  and  type  of  feature,  including: 

1.  linear  features  (roads,  drainage,  power 
lines) 

2.  compact  3-D  structures  (buildings,  power 
poles) 

3.  aerial  features  (water  bodies,  land  cover 
classification) 
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Figure  1:  The  Battlespace  Observer  System  Architecture 


4.  terrain  (elevation,  prominent  visible  fea¬ 
tures,  mobility  characteristics) 

5.  man-made  movable  objects  (vehicles, 
time  critical  mobile  targets) 

•  A  persistent  object-oriented  blackboard  to 

store  our  continuously  updatable  world  model. 
It  can  accommodate  incomplete  and  conflict¬ 
ing  information  -  with  pointers  back  to  the  rel¬ 
evant  imagery  when  it  is  desirable  for  the  ap¬ 
plication  to  present  the  user  with  a  synthesized 
view  of  the  scene  rather  than  an  interpreted  de¬ 
scription;  this  is  especially  important  when  the 
feature  extraction  algorithm  knows  it  has  failed 
to  make  a  correct  interpretation. 

•  Establishment  of  an  APGD  “Virtual  Lab,” 

utilizing  the  WWW,  for  allowing  continuous 
evaluation  of  our  progress  and  for  forming  and 
strengthing  an  APGD  community.  The  Virtual 
Lab  will  enable  continuous  access  to  raw  data, 
contextual  information,  ground  truth,  extracted 


models,  and  evaluation  procedures. 

The  major  deliverable  at  the  end  of  the  two-year 
base  period  will  be  a  detailed  specification  for  a 
modular,  extensible  architecture,  supporting  algo¬ 
rithms,  and  evaluation  techniques  for  the  task  of 
APGD.  At  the  end  of  the  full  five-year  contract, 
we  will  have  produced  an  application-ready,  end- 
to-end,  prototype  system  that  can  be  demonstrated 
and  evaluated  within  an  instrumented  environment. 
This  system  will  be  suitable  for  quasi-operational 
use  at  TEC,  the  NEL,  or  other  government  labs.  In 
the  interim,  we  expect  that  many  of  our  advances 
will  have  been  transfered  into  operational  systems. 

2  Scientific  Issues  and  Research  Questions 

The  central  problem  we  address  in  this  effort  is  that 
of  reducing  the  role  of  (or  completely  removing) 
the  human  operator  from  the  “low-level”  processes 
of  constructing  geospatial  models  from  images  and 
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stored  data.  The  primary  contributions  of  the  human 
operator  are  (1)  to  understand  the  nature  of  the  task 
being  performed  so  as  to  be  able  to  evaluate  the  in¬ 
termediate  and  final  products  of  the  automated  func¬ 
tions,  and  (2)  to  select  the  appropriate  extraction  al¬ 
gorithms  to  be  employed  and  to  guide  the  extraction 
system  toward  a  satisfactory  answer  by  making  ap¬ 
propriate  adjustments  in  its  parameters. 

Thus,  to  remove  the  human,  we  must  be  able  to  pro¬ 
duce  an  “objective  function”  (which  we  will  call 
ObjF)  that  can  numerically  or  categorically  evalu¬ 
ate  the  adequacy  (for  some  specified  application)  of 
a  proposed  interpretation  of  the  visual  data  in  the 
absence  of  explicit  knowledge  of  “ground  truth.  If 
we  can  accomplish  this  goal,  we  can  also  accom¬ 
plish  the  difficult  part  of  step  (2)  by  iteratively  mod¬ 
ifying  the  extraction-algorithm  parameters  until  we 
achieve  a  suitable  score  for  the  suggested  interpre¬ 
tation. 

It  is  almost  certainly  the  case  that  the  amount  of  in¬ 
telligence  we  would  have  to  encapsulate  in  the  ObjF 
to  completely  remove  the  human  is  not  achievable 
in  the  relatively  short  time  frame  of  this  effort  - 
but  if  we  can  adequately  characterize  the  expected 
performance  of  the  algorithms  currently  available, 
and  if  we  can  automatically  evaluate  their  (the  al¬ 
gorithm’s)  true  likelihood  of  having  produced  a  cor¬ 
rect  answer  in  any  given  trial,  then  we  can  indeed 
achieve  the  radical  improvements  in  productivity  we 
are  expecting  to  come  out  of  this  work.  The  hu¬ 
man  would  still  have  a  critical  role  in  initializing 
the  system  and  in  editing  the  final  result/output,  but 
he  would  be  removed  from  the  most  time  consum¬ 
ing  tasks  that  he  must  now  perform,  and  he  would 
not  have  to  understand  the  workings  of  the  machine 
vision  system  itself,  but  only  the  application  and  its 
requirements. 

The  architecture  we  have  suggested  provides  a 
framework  for  taking  advantage  of  the  proposed  au¬ 
tomation  of  the  technique  invocation  and  evaluation 
role  of  the  human  operator,  it  does  not  provide  that 
technology  in  itself.  It  is  clear  that  developing  ef¬ 
fective  techniques  for  the  prediction  and  evaluation 
of  the  results  of  the  algorithmic  (extraction)  pro¬ 
cesses  will  be  the  key  factor  in  the  success  of  this 
work.  Some  of  the  major  scientific  problems  to  be 
addressed  include; 


2.1  (Automated)  Evaluation 

As  discussed  above,  self-evaluation  by  an  algo¬ 
rithm,  or  by  the  more  comprehensive  system  con¬ 
trolling  the  algorithm,  is  perhaps  the  most  important 
ingredient  in  achieving  automation  and  robustness. 
We  consider  this  problem  our  main  scientific  chal¬ 
lenge  in  this  effort. 

In  addition  to  (internal,  on-line,  real-time)  self- 
evaluation,  the  more  conventional  problem  of  (ex¬ 
ternal)  objective  evaluation  is  also  critical  to  the 
success  of  our  work.  In  order  to  properly  eval¬ 
uate  performance,  one  should  have  a  comprehen¬ 
sive  model  of  the  process  to  be  evaluated  and  the 
problem  domain  on  which  the  process  operates,  as 
well  as  definitive  way  to  measure  the  correctness  of 
a  proposed  answer.  None  of  these  items  are  cur¬ 
rently  available  in  the  APGD  domain.  One  of  the 
most  serious  problems  is  that  of  obtaining  adequate 
ground  truth  to  do  statistically  meaningful  testing. 
The  obvious  way  to  get  such  ground  truth  is  to  have 
the  type  of  system  we  are  proposing  to  develop  (but 
don’t  yet  have). 

2.2  Robustness 

The  human  operator  is  able  to  quickly  and  easily 
detect  and  remove  gross  errors  (blunders)  in  algo¬ 
rithm  performance.  Since  we  can’t  hope  to  du¬ 
plicate  the  human  operators  intelligence,  common- 
sense,  or  experience,  we  must  develop  computation¬ 
ally  effective  procedures  for  combining  independent 
automatically  derived  judgments  on  all  important 
automated  decision  processes.  Even  if  we  are  suc¬ 
cessful  in  detecting  and  removing  most  blunders,  a 
few  will  undoubtily  slip  through  our  filters  -  as  well 
as  a  larger  number  of  almost  correct  but  still  erro¬ 
neous  numerical  evaluations.  Our  compiled  models 
and  our  data  base  will  contain  incomplete,  uncer¬ 
tain,  and  conflicting  information.  Nevertheless,  our 
composite  system  will  be  expected  to  perform  con¬ 
sistently  and  correctly. 

2.3  Introduction  of  an  Automated 
Learning  Component 

The  problem  of  automatically  deriving  optimal  al¬ 
gorithm  parameters  based  on  scene  content  and  ac¬ 
quisition  conditions  is  not  a  realistic  near  term  ob- 
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jective.  It  will  probably  be  necessary  to  acquire  this 
information  through  direct  observation  and  evalu¬ 
ation  of  algorithm  performance.  We  will  develop 
and  incorporate  (into  CBACS)  an  automated  learn¬ 
ing  component  to  perform  this  function. 

2.4  Algorithm  Competence 

Many  of  the  existing  extraction  techniques  we  ex¬ 
pect  to  use  appear  to  be  competent  to  perform  their 
intended  functions,  but  are  deficient  in  some  crit¬ 
ical  way;  for  example,  they  require  exponentially 
increasing  resources  with  increase  in  image  size, 
or  they  make  occasional  blunders  which  they  can¬ 
not  detect  (no  self-evaluation),  or  they  require  some 
contextual  information  that  might  not  be  routinely 
available.  The  BOS  architecture  we  are  developing 
has  machinery  to  cover  some  of  these  deficiencies, 
but  the  actual  interface  between  a  defective  algo¬ 
rithm  and  the  architecture  may  require  innovation 
equal  to  that  of  directly  improving  the  algorithm. 
The  Architecture  increases  both  our  options  and  our 
chances  for  finding  a  fix;  it  does  not  eliminate  the 
problem  of  producing  more  competent  algorithms. 

3  Evaluation 

Our  evaluation  plan  focuses  on  three  high-level  met¬ 
rics  associated  with  the  automatic  population  of 
geospatial  databases.  The  first  is  the  amount  of  time 
a  user  requires  to  interact  with  the  system  to  extract 
and  assign  attributes  to  a  specified  set  of  features. 
The  second  is  the  accuracy  of  the  extracted  features 
and  their  attributes.  The  third  is  the  utility  of  the  im¬ 
age  sequences  generated  from  the  extracted  models 
(to  support  “virtual  worlds”  applications). 

Within  the  RADIUS  program,  we  partially  instru¬ 
mented  the  RCDE  system  to  monitor  and  record 
the  actions  of  a  user  applying  various  techniques 
to  extract  roads  [Heller  et  al,  1996].  The  system 
recorded  the  number  of  mouse-clicks,  the  selected 
actions,  and  the  amount  of  mouse-travel  required  to 
achieve  a  desired  result.  We  feel  that  this  is  a  bet¬ 
ter  measure  than,  for  example,  actual  computation 
times  because  it  truly  reflects  the  amount  of  human 
interaction  and  does  not  depend  on  the  speed  of  the 
computer  being  used.  * 

'Additional  details  can  be  found  at  URL:  http:  //www. 
ai . sri . com/ “radius /sri /baa- reports/. 


For  this  project,  we  plan  to  extend  this  approach 
and  evaluate  algorithms  in  terms  of  the  type  and 
amount  of  parameter  tuning  they  need  to  perform  a 
task  and  the  types  of  errors  they  make.  In  this  way, 
the  performance  of  the  algorithm  can  be  parame¬ 
terized  by  the  level  of  human  interaction  required, 
both  for  initialization  and  clean-up,  in  addition  to 
the  accuracy  of  the  result.  This  allows  us  to  derive 
a  multi-dimensional  rating  for  an  algorithm,  rather 
than  a  simple  rank  ordering.  To  choose  a  particu¬ 
lar  algorithm  for  a  given  task  in  a  given  context,  the 
requirements  (e.g.  time,  accuracy,  other  available 
data)  are  used  to  weight  the  various  ratings  for  an 
algorithm  to  derive  an  application-specific  ranking. 

To  measure  the  accuracy  of  extracted  features  and 
their  attributes,  we  plan  to  use  our  current  tools  to 
construct  detailed  models  of  three  sites  and  com¬ 
pare  them  to  models  constructed  with  candidate  au¬ 
tomated  feature  extraction  techniques.  Figure  2 
shows  part  of  a  model  being  constructed  of  the  mo¬ 
tor  pool  area  at  Ft.  Hood.  Portions  of  these  “ground- 
truth”  models  will  be  available  continuously  on  our 
web  site,  so  FRE  contractors  and  other  groups  can 
compare  their  latest  results  to  ground  truth.  In  ad¬ 
dition,  periodically,  we  will  run  evaluations  using 
previously  unreleased  data,  or  new  portions  of  the 
ground-truth  models,  to  record  and  report  progress 
on  tasks  where  “algorithm  tuning”  is  not  possible. 

The  third  metric  is  the  utility  of  the  dynamic  visu¬ 
alizations  or  image  sequences  generated  from  the 
database.  This  is  the  most  difficult  attribute  to  quan¬ 
tify  because  it  not  only  depends  on  the  quantity  and 
quality  of  the  extracted  features,  but  also  on  the  task 
being  performed  (e.g.,  mission  rehearsal)  and  on 
the  rendering  techniques  and  graphics  engines  used. 
We  plan  to  explore  a  variety  of  approaches  to  this 
problem,  including  ideas  such  as  asking  users  to  an¬ 
swer  questions  about  the  site  after  viewing  a  “fly- 
through.”  This  approach  will  provide  an  empirical 
evaluation  of  the  generated  sequences  for  specific 
tasks.  We  will  also  show  sequences  to  people  and 
ask  them  to  list  perceptual  problems.  For  example, 
the  inter-object  visibility  may  not  be  handled  cor¬ 
rectly  because  the  objects  were  not  delineated  prop¬ 
erly,  or  the  buildings  may  appear  to  float  in  the  air 
because  a  consistency  mechanism  failed. 

An  important  component  of  our  evaluation  and 
demonstration  plans  is  the  APGD  Virtual  Lab.  Uti- 
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Figure  2:  Part  of  a  “ground  truth”  model  being  constructed  of  Ft.  Hood. 


lizing  the  WWW,  this  will  allowing  continuous  tomated  techniques  currently  incorporated  in  the 

evaluation  of  our  progress  and  foster  the  formation  BOS.  Hyperlinks  in  the  models  themselves,  will  al¬ 
and  strengthing  of  the  APGD  community.  We  en-  low  the  user  to  retrieve  performance  statistics  and 

vision  a  central  repository  on  the  web  with  links  intermediate  results  that  are  relevant  to  the  model 

to  raw  and  processed  datasets,  interactively  con-  construction  process, 

structed  “ground  truth”  models  and  models  con¬ 
structed  through  the  use  of  the  best  available  au- 
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4  Summary 

The  objective  of  this  APGD IFD  effort  is  to  develop, 
evaluate,  and  demonstrate  the  enabling  technology 
that  will  permit  rapid,  robust,  automated  popula¬ 
tion  of  geospatial  databases  -  especially  in  support 
of  “virtual  worlds”  and  image-exploitation  appli¬ 
cations.  This  technology  does  not  currently  exist; 
therefore,  our  primary  concern  in  the  first  two  years 
of  this  effort  is  research/technology  development. 
This  first  phase  of  the  APGD/IFD  effort  is  not  an 
attempt  to  construct  a  testbed,  or  a  workstation,  or 
even  a  completely  integrated  system. 

The  central  problem  our  work  will  address  is  that 
of  reducing  the  role  of  (or  completely  removing) 
the  human  operator  from  the  “low-level”  processes 
of  constructing  geospatial  models  from  images  and 
stored  data.  We  will  develop  the  technology  to 
achieve  robust  (predictable,  reliable)  system  oper¬ 
ation  employing  automated  algorithms. 

Our  approach  is  centered  on  a  unique  architecture 
that  includes  the  following  major  components:  a 
set  of  automated  “feature  extraction  managers”  and 
a  “context-based  algorithm  control  system;”  a  sen¬ 
sor  calibration  and  control  system  that  will  sup¬ 
port  cross-sensor  analysis  of  ntm,  sar,  ifsar,  ir, 
panchromatic,  and  hyperspectral  imagery;  a  collec¬ 
tion  of  algorithms  for  feature  extraction  and  attri¬ 
bution  primarily  concerned  with  roads,  rivers,  lines 
of  communication,  buildings,  aerial  features  (land 
cover  classification,  material  identification,  water 
bodies,  etc.),  terrain,  and  man-made  movable  ob¬ 
jects;  and  a  persistent  object-oriented  blackboard 


to  store  our  continuously  updatable  world  model. 

A  primary  theme  of  almost  every  aspect  of  our 
work  is  that  of  evaluation.  In  addition  to  the  crit¬ 
ical  role  evaluation  plays  in  our  context-based  al¬ 
gorithm  control  system,  there  are  many  other  rea¬ 
sons  why  the  development  of  an  effective  evaluation 
methodology  is  a  desirable  goal  in  itself.  For  exam¬ 
ple,  the  ability  to  track  progress  and  the  ability  to 
engineer  systems  which  can  meet  specified  perfor¬ 
mance/robustness  objectives.  We  believe  that  a  suc¬ 
cessful  effort  to  develop  and  exploit  evaluative  tech¬ 
niques  is  equal  in  importance  to  the  improvement  of 
the  feature-extraction  algorithms  being  evaluated. 
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Abstract 

This  paper  presents  an  overview  of  a  planned 
system  for  automatic  population  of  geospatial 
databases  which  represent  urban  exteriors  as 
textured  geometric  model  data.  The  salient  fea¬ 
tures  of  the  system  are  that  1)  it  employs  “pose- 
imagery;”  high-quality,  digital  still  images,  each 
timestamped,  and  annotated  with  a  reliable  es¬ 
timate  of  the  acquiring  camera’s  absolute  po¬ 
sition  and  attitude;  2)  it  exploits  both  dense 
and  sparse  image  features  by  sifting  evidence 
for  interpixel  correlation  across  every  image;  3) 
it  addresses  the  combinatorial  aspects  of  large- 
scale  3D  reconstruction  from  very  large  number 
of  images;  and  4)  it  has  no  “human  in  the  loop.” 

This  work  involves  both  engineering  and  re¬ 
search  challenges.  The  engineering  chal¬ 
lenges  include  rapid  acquisition  of  6-DOF  pose- 
instrumented,  high-resolution  imagery,  and  its 
insertion  into  a  hierarchical  three-dimensional 
data  structure.  The  research  challenges  include 
dense  reconstruction  techniques  using  thou¬ 
sands  of  images,  and  the  incremental  construc¬ 
tion  of  a  textured  three-dimensional  model  from 
selected,  spatially  related  image  subsets. 

Also  presented  are  an  evaluation  plan  for  the 
project,  as  well  as  a  description  of  the  synthetic, 
hybrid,  and  actual  datasets  to  be  acquired  and 
processed. 
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1  Introduction 

We  have  proposed  the  construction  of  a  pro¬ 
totype  system  whose  objective  is  the  fully 
automatic  population  of  geospatial  databases 
(APGD)  for  built-up  areas.  We  are  develop¬ 
ing  algorithms  to  integrate  ground-based,  low- 
flyby,  and  satellite  imagery;  robustly  handle  il¬ 
lumination  changes  and  occluding  foliage;  and 
gracefully  yield  multi-resolution  models  in  the 
presence  of  data  with  varying  quality.  The 
generated  models  will  have  a  form  suitable  for 
visual  simulation,  collision  detection,  change 
detection,  line  of  sight  simulation,  and  other 
physically-based  simulation  operations.  We 
hope  also  to  support  model  amplification  (given 
additional  observations)  and  modification  (to 
incorporate  the  effects  of,  e.g.,  demolition,  ren¬ 
ovation,  or  construction  activity.  Finally,  when 
data  is  being  gathered,  the  system  should  be 
able  to  provide  useful  directives  as  to  where  fur¬ 
ther  imagery  is  needed. 

The  objective  of  the  system  is  to  fundamentally 
accelerate  and  improve  the  population  process 
for  complex  geospatial  databases,  surmounting 
the  scaling  problem  and  human  dependencies 
posed  by  existing  semi-automatic  systems,  and 
providing  a  fast  new  model  acquisition  capabil¬ 
ity  for  simulation  operations.  An  operational 
system  would  also  have  significant  commercial 
applications:  entertainment,  embedding  of  com¬ 
mercial  databases,  tourism,  education,  etc. 

2  Objectives 

The  principal  goal  of  the  system  is  to  pop¬ 
ulate  textured  geospatial  databases  in  a  fully 
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automated  fashion,  and  to  dramatically  re¬ 
duce  the  cost  and  time  required  to  populate 
such  databases  as  compared  to  existing  semi¬ 
automatic  (i.e.,  human-assisted)  systems.  Using 
many  thousands  of  close-range  images,  rather 
than  a  few,  and  annotating  them  with  high- 
quality  pose  estimates  in  an  absolute  Earth  co¬ 
ordinate  system,  make  the  “footprint”  of  this  ef¬ 
fort  quite  distinct  from  those  of  existing  efforts 
[Niem,  1994]  [Collins  et  of.,  1995]  [Streilein, 
1995].  We  hope  to  learn  much  and  develop 
a  class  of  useful  3D  reconstruction  algorithms 
novel  from  both  research  and  engineering  stand¬ 
points. 

The  fundamental  engineering  strategy  employed 
by  our  system  is  the  use  of  recently  avail¬ 
able  global-positioning  systems  (GPS)  and  PC- 
hosted,  high-quality  accelerometry  which  yield 
continuous  estimates  of  camera  position  and  ori¬ 
entation  in  an  absolute  coordinate  system.  The 
resnlting  “pose-camera”  will  ease  some  of  the 
most  significant  hindrances  to  achieving  auto¬ 
mated  3D  reconstruction  [Faugeras,  1993].  We 
have  developed  several  novel  algorithms  that 
process  pose-imagery  to:  establish  sparse  (edge) 
and  dense  (region)  correspondences;  identify  re¬ 
gions  of  empty  space;  and  reconstruct  globally 
consistent  3D  models  (with  confidence  bounds) 
from  local,  occluded  observations.  Absolute 
pose  estimates  are  used  to  avoid  combinato¬ 
rial  blowup  while  processing  very  large  numbers 
(thousands)  of  images. 

Another  principal  system  objective  is  that  the 
quality  and  articulation  of  the  reconstructed 
models  degrade  gracefully  with  the  precision 
and  resolution  of  the  acquired  pose  and  image 
data.  Thus,  pose  imagery  arising  from  “fast” 
drive-bys  or  fly-bys  will  yield  crude  but  con¬ 
sistent  3D  models.  Higher  resolution,  or  more 
densely  sampled,  image  data  will  enable  recon¬ 
structed  geometry  and  texture  of  correspond¬ 
ingly  higher  fidelity  (model  “amplification”  or 
“intensification”).  We  plan  to  validate  these 
algorithms  by  contrasting  their  output  with 
the  product  of  traditional  (human-effected)  site 
modeling  techniques. 

3  Project  Plan 

We  are  developing  both  an  innovative  device  (a 
pose-camera)  and  innovative  algorithms  (for  3D 


data  organization  and  reconstruction)  for  au¬ 
tomated  population  of  a  geospatial  database. 
The  geospatial  database  represents  the  geo¬ 
metric  and  reflectance  characteristics  of  built 
structures  observed  in  the  physical  world,  as 
well  as  significant  trees  and  vegetation.  All 
data  are  modeled  by  templates,  that  is,  canon¬ 
ical  parametrized  objects  fit  to  multiple  ob¬ 
servations.  The  generated  3D  model  will 
support  line-of-sight  computations,  physically- 
based  collision  detection,  and  other  simulation 
modes.  Associated  reflectance  information  in¬ 
creases  realism  during  simulation  by  enabling 
reillumination  by  synthetic  light  sources. 

We  will  pursue  four  specific  subtasks  in  order 
to  demonstrate  system  feasibility.  First,  we 
will  design  and  build  a  pose-instrumented  dig¬ 
ital  camera.  Second,  we  will  deploy  the  in¬ 
strument  in  a  known  test  environment  to  de¬ 
termine  its  performance.  Third,  we  will  deploy 
the  pose-camera  in  a  complex,  unmodeled  en¬ 
vironment,  and  develop  software  algorithms  for 
deriving  3D  textured  geometry  and  foliage  rep¬ 
resentations  from  the  acquired  pose  imagery. 
Fourth,  we  will  assess  the  system  utility  and 
cost  by  manually  modeling  some  portion  of  the 
automatically-acquired  dataset,  and  comparing 
traditional  techniques  with  our  methods. 

We  distinguish  between  three  types  of  pose- 
imagery  data  used  for  algorithm  development 
and  testing.  Simulated  data  is  synthetic  im¬ 
agery  with  arbitrary  specified  pose,  generated 
with  standard  geometric  modeling  and  com¬ 
puter  graphics  (rendering)  algorithms.  Hybrid 
data  is  imagery  acquired  by  a  manually  oper¬ 
ated  camera  mounted  on  a  tripod,  with  pose  es¬ 
timates  derived  a  posteriori  via  semi-automatic 
photogrammetry  [Horn,  1986]  or  other  human 
assistance.  Actual  pose  imagery  is  that  pro¬ 
duced  by  the  operational  pose  camera,  with 
initial  pose  estimates  given  by  instrumentation 
and  refined  by  automated  numerical  optimiza¬ 
tion  algorithms. 

4  Challenges 

Engineering  challenges  in  this  project  include 
rapid  acquisition  of  6-DOF  pose-instrumented, 
high-resolution  imagery,  and  its  insertion  into 
a  hierarchical  three-dimensional  data  structure. 
Instrumentation  packages  for  determining  abso- 
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lute  position  and  orientation  exist,  but  (as  com¬ 
mercially  available)  in  a  largely  unintegrated 
state.  We  are  assembling  integrated,  PC-hosted 
instrumentation  which  will  maintain  pose  es¬ 
timates  through  a  combination  of  global  posi¬ 
tioning,  accelerometry,  and  odometry.  This  in¬ 
strumentation  will  be  mounted  on  a  wheeled, 
human-propelled  acquisition  platform  (Fig.  1), 
along  with  a  high-resolution  digital  camera, 
on-board  PC  and  digital  tape  storage,  and  a 
battery-based  power  source. 


Figure  1:  A  model  of  the  pose  camera. 


Raw  high-resolution  image  data  demands  an 
enormous  amount  of  storage  space.  An 
external-memory  spatial  data  structure  medi¬ 
ates  storage  and  processing  of  both  the  an¬ 
notated  imagery  and  the  reconstructed  model 
data.  Moreover,  specular  effects  and  changes 
in  lighting  conditions  during  acquisition  will 
complicate  data  collection  and  matching  efforts. 
However,  these  same  factors  also  make  possible 
estimation  of  the  directional  reflectance  proper¬ 
ties  of  each  reconstructed  surface. 

Due  to  the  size  of  the  input  and  output  datasets, 
only  the  most  skeletal  global  operations,  such  as 
insertion  of  representations  of  acquiring  cam¬ 


eras  into  a  hierarchical  spatial  data  structure, 
will  be  possible.  Global,  pairwise  matching 
strategies  are  unworkable,  since  they  would  re¬ 
sult  in  combinatorial  explosion.  Instead,  lo¬ 
cal  operations  will  correlate  imagery  acquired 
by  proximal  cameras,  or  suspected  to  contain 
observations  of  related  physical  structures.  In¬ 
cremental  construction  and  insertion  of  recon¬ 
structed  geometry  from  overlapping,  adjoining 
imagery  subsets  must  be  supported.  Associ¬ 
ating  spatially  proximal  cameras  avoids  both 
falsely  negative  and  falsely  positive  matches 
that  arise  in  schemes  relying  only  upon  tem¬ 
poral  coherence  (Fig.  2). 


Figure  2:  Temporal  strategies  can  wrongly  as¬ 
sociate  images  (a,b)  or  fail  to  asso¬ 
ciate  relevant  images  (c,d). 


We  augment  traditional  “sparse”  (point- 
and  edge-based)  correspondence  algorithms 
[Faugeras,  1993]  with  “dense”  processing  algo¬ 
rithms  which  correlate  every  pixel  in  every  im¬ 
age  with  some  other  pixel  set.  The  goal,  of 
course,  is  to  find  that  representation  of  the  3D 
world  most  consistent  with  the  totality  of  2D 
observations  (pose  images).  Conservative  er¬ 
ror  bounds  will  be  maintained  for  all  recon¬ 
structed  features  and  structures.  Confidence  in 
this  representation  should  increase  with  more 
numerous,  more  finely  sampled,  or  more  higher- 
resolution,  acquired  imagery. 

Foliage  imagery  can  be  a  significant  hindrance 
to  reconstruction  of  urban  exteriors.  We  ad¬ 
dress  this  issue  with  a  model-based  foliage  re¬ 
construction  scheme.  Synthetic  images  of  proce- 
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durally  generated  foliage  are  subjected  to  filters, 
and  their  responses  associated  with  the  foliage 
generator  settings.  Imagery  of  actually  occur¬ 
ring  foliage  is  then  subjected  to  analogous  fil¬ 
tering  in  order  to  identify  similar  procedural  fo¬ 
liage.  The  objective  is  not  to  reconstruct  foliage 
directly,  but  rather  produce  procedural  foliage 
representations  morphologically  similar  to  those 
observed. 

5  Evaluation  Plan 

We  plan  to  evaluate  the  pose  camera’s  accuracy 
by  deploying  it  in  a  known  (manually-surveyed) 
arena  and  testing  its  reported  pose  against  inde¬ 
pendently  derived  estimates.  We  will  establish 
mean  and  maximum  errors  in  camera  position 
and  camera  attitude  as  a  function  of  time  and 
distance  from  start,  both  for  slow,  regular  mo¬ 
tions,  and  for  rapid  translation  and  rotational 
acceleration. 

We  are  completing  a  careful  “ground-truth”  sur¬ 
vey  of  the  Technology  Square  area.  We  will 
evaluate  the  quality  of  initial  reconstructions  by 
comparison  to  ground  truth  in  the  form  of  man¬ 
ual  survey  data,  and  architectural  facilities  data 
maintained  by  the  campus  Office  of  Facilities 
and  Management.  We  will  determine  the  ac¬ 
curacy  and  resolution  of  3D  feature  reconstruc¬ 
tion,  and  assess  the  faithfulness  of  the  recon¬ 
structed  datasets.  We  plan  also  to  develop  mod¬ 
els  of  degradation  of  reconstructed  data  with 
loss  of  image  resolution  or  pose  accuracy. 
During  year  two,  we  will  attempt  to  achieve 
basic,  block-based  feature  and  building  recon¬ 
struction  of  Technology  Square  and  some  por¬ 
tion  of  the  MIT  campus  (from  five  to  two  hun¬ 
dred  structures).  The  pose-camera  will  be  man¬ 
ually  moved  through  the  relevant  areas.  Our 
initial  system  implementation  will  generate  re¬ 
constructed  geometry  and  texture  data.  Tex¬ 
tures  (reflectance  maps)  for  each  building  facade 
will  be  generated  by  aggregating  disparate  pose 
images,  to  be  compared  to  “ground  truth”  by 
reference  to  axis-aligned  photographs  of  build¬ 
ing  facades  acquired  manually  under  measured, 
nearly  diffuse  lighting  conditions. 

Finally,  we  plan  to  evaluate  the  throughput  and 
cost  of  the  system  according  to  several  met¬ 
rics.  The  rate  of  pose-imagery  acquisition  will 


be  measured.  The  computational  costs  of  3D 
reconstruction  will  be  assessed,  along  with  dol¬ 
lar  estimates  of  system  overhead  and  per- feature 
cost.  Finally,  system  operation  will  be  com¬ 
pared,  for  a  single  large  dataset  (imagery  of  sev¬ 
eral  hundred  structures) ,  with  that  of  an  expert 
human  operating  a  traditional  semi-automatic 
photogrammetry  system. 

6  Conclusion 

We  have  described  an  ambitious,  but  feasible, 
vision  of  a  system  for  fully  automatic  popu¬ 
lation  of  textured  geospatial  databases  repre¬ 
senting  built-up  areas,  with  no  human  in  the 
loop  except  to  direct  motion  of  the  pose-camera. 
The  system  design  exploits  recent  advances  in 
pose  instrumentation.  Geospatial  data  entities 
and  associated  textures  are  incrementally  de¬ 
duced  from  a  large  collection  of  images  with 
pose  estimates  of  varying  accuracy.  The  sys¬ 
tem  output  is  a  collection  of  geometric  entities, 
each  with  a  conservative  confidence  bound,  or¬ 
ganized  in  a  hierarchical  spatial  database  suit¬ 
able  for  external-memory  algorithms  such  as 
those  employed  by  real-time  simulation  systems. 
Because  the  acquired  dataset  is  fully  three- 
dimensional,  it  can  be  subjected  to  collision 
detection,  line  of  sight,  arbitrary  lighting  and 
atmospheric  conditions,  and  other  physically- 
based  or  phenomenologically-based  simulation 
operations. 
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Abstract 

Constmcting  geospatial  databases  is  a  tedious  man¬ 
ual  operation.  Automatic  3-D  feature  extraction 
from  2-D  images  requires  solving  a  number  of 
problems.  We  present  a  plan  to  attack  this  task  using 
a  combination  of  tools:  reconstruction  and  reason¬ 
ing  in  3-D,  use  of  multiple  sources  of  data,  percep¬ 
tual  grouping,  use  of  context  and  domain  knowl¬ 
edge,  use  of  previous  maps  and  models,  and  limited 
use  of  human  input  where  applicable. 

1  Objectives 

Geospatial  databases  are  important  for  a  number  of 
battlefield  awareness  tasks  such  as  mission  plan¬ 
ning,  mission  rehearsal,  tactical  training  and  dam¬ 
age  assessment.  Other  applications  include  intelli¬ 
gence  analysis  for  site  monitoring  and  change  de¬ 
tection.  Geospatial  database  requirements  may  vary 
for  the  different  tasks,  but  generally  knowledge  of 
terrain  elevation,  surface  features  and  cultural  fea¬ 
tures  is  needed.  Our  task  focuses  on  the  automatic 
detection  and  description  of  cultural  features,  par¬ 
ticularly  buildings. 

The  importance  of  cultural  features  for  multiple 
tasks  is  quite  clear.  A  mission  plan  in  urban  or  semi- 
urban  environments  must  consider  buildings  and 
similar  structures.  These  may  be  the  targets  of  an 
operation  or  assets  to  be  utilized  in  the  mission. 
Road  networks  and  buildings  are  also  key  compo¬ 
nents  in  a  simulated  scene  for  such  environments. 
Damage  assessment  reports  also  may  be  required 
for  the  infrastructure.  For  site  monitoring,  cultural 
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search  Projects  Agency  of  the  Department  of  Defense  and  was 
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features  are  not  only  of  interest  in  themselves  but 
also  provide  context  for  the  detection  of  other  fea¬ 
tures  such  as  vehicles.  The  cultural  features  in  a  da¬ 
tabase  also  help  in  orienting  an  analyst  to  the  obser¬ 
vation  of  a  new  image  by  registering  the  image  to  a 
site  model  and  by  pointing  out  the  major  known  fea¬ 
tures  of  interest.  Thus,  the  construction  of  geospa¬ 
tial  databases  containing  cultural  features  is  of  key 
importance  in  aU  aspects  of  battlefield  environment, 
from  initial  site  monitoring  to  mission  planning  and 
rehearsal,  in  the  execution  of  the  mission  itself,  and 
then  in  an  analysis  of  the  results  after  an  action. 

Some  commercial  softcopy  systems  for  construct¬ 
ing  geospatial  databases  are  available.  These  can  be 
helpful  in  the  process  of  recording  the  data  and  car¬ 
rying  out  many  photogrammetric  computations. 
However,  the  extraction  of  the  important  features, 
particularly  the  cultural  features,  largely  remains  a 
manual  task.  At  best,  these  systems  allow  a  user  to 
choose  parameters  of  a  prototype  shape;  the  large 
number  of  parameters  can  make  this  a  tedious  and 
inefficient  process.  There  has  been  some  progress 
on  automated  detection  in  research  systems  but  the 
capabilities  remain  limited.  We  need  to  significant¬ 
ly  expand  the  classes  of  objects  that  can  be  mod¬ 
eled,  the  conditions  under  which  they  may  be  im¬ 
aged  and  to  make  the  systems  more  robust  and  reli¬ 
able. 

2  Research  Issues  and  echnical  Approach 

The  problem  of  3-D  feature  extraction  is  difficult  in 
many  ways.  Low  level  segmentation  techniques 
(such  as  line  detection)  give  incomplete  and  imper¬ 
fect  results.  Object  boundaries  may  be  incomplete 
and  many  extraneous  boundaries,  due  to  markings. 
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shadows,  noise  and  other  features,  may  be  present 
as  can  be  seen  in  the  example  of  Figure  1.  The  sys¬ 
tem  must  complete  the  desired  object  boundaries 
and  discard  the  extraneous  ones.  It  must  also  infer 
the  3-D  structure  of  the  objects,  which  is  not  explic¬ 
it  in  an  intensity  image  and  needs  to  be  inferred. 


Figure  1  Image  (top)  and  extracted  line 
segments,  (bottom) 


aries.  However,  some  extraneous  boundaries  due  to 
noise  and  other  sources  would  remain.  Also,  sen¬ 
sors  such  as  IFSAR  can  not  provide  a  complete 
range  map,  typically  the  data  has  areas  of  missing 
elements  and  may  contain  some  points  with  grossly 
erroneous  values. 

In  the  absence  of  an  ideal  range  sensor,  we  can  infer 
depth  from  multiple  images.  However,  this  requires 
finding  corresponding  points  or  features  in  two  or 
more  images,  which  is  a  complex  task  in  itself.  Al¬ 
so,  such  a  process  gives  sparse  information  about  3- 
D  features  which  must  still  be  grouped  appropriate¬ 
ly  to  extract  the  desired  objects. 

We  plan  to  overcome  the  difficulties  of  3-D  object 
detection  and  description  by  using  a  combination  of 
tools:  reconstruction  and  reasoning  in  3-D,  use  of 
multiple  sources  of  data,  perceptual  grouping,  use 
of  context  and  domain  knowledge,  use  of  previous 
maps  and  models,  and  limited  use  of  human  input 
where  applicable.  This  combination  is  shown  sche¬ 
matically  in  Figure  2.  Reasoning  in  3-D  has  many 
advantages,  as  the  objects  we  desire  are  3-D  objects 
rather  than  2-D  surface  features.  The  use  of  naulti- 
ple  images  and  multiple  sources  of  data  will  aid  in 
the  problems  of  3-D  reconstruction  and  also  in  re¬ 
solving  ambiguities  that  may  be  present  in  a  single 
image  or  in  images  taken  from  a  single  sensor.  Per¬ 
ceptual  grouping  is  an  essential  step  in  selecting  and 
organizing  lower  level  features  into  meaningful  ob¬ 
jects.  The  use  of  context  and  domain  knowledge 
will  help  with  the  reduction  of  ambiguities,  with 
help  in  attribution  and  in  some  cases  help  with 
choosing  the  appropriate  algorithm  paranieters.  In 
cases  where  previous  maps  or  models  exist,  these 
can  be  used  to  reduce  the  needed  work;  instead  of 
building  complete  models,  the  system  can  focus  on 
change  detection  and  updating.  Even  with  use  of 
these  multiple  tools,  some  human  interaction  may 
be  required,  either  to  initiate  the  automatic  process¬ 
es  correctly  or  to  edit  the  results.  We  do  this  when 
necessary  but  minimize  the  effort  required  from  the 
human  operator. 

We  now  describe  the  elements  of  our  approach  in 
more  detail: 


The  segmentation  and  3-D  description  problems  be¬ 
come  easier  if  a  3-D  range  sensor,  such  as  LADAR 
or  IFSAR  is  available,  as  image  discontinuities  cor¬ 
respond  more  directly  to  object  or  surface  bound¬ 


a)  Reconstructing  and  Reasoning  in  3-D 

Explicit  3-D  representations  are  needed  for  many 
applications.  Knowledge  of  3-D  also  makes  the  task 
of  segmentation  easier  as  we  can  find  depth  and  ori- 
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entation  discontinuities  and  separate  them  from  im¬ 
age  discontinuities  caused  by  sources  such  as  mark¬ 
ings,  shadows  and  highlights. 

Some  sensors,  such  as  IFSAR,  provide  direct  mea¬ 
surement  of  height,  however,  this  information  may 
be  incomplete  and  may  contain  significant  isolated 
errors.  Height  can  also  be  extracted  from  multiple 
2-D  images  by  making  correspondences  between 
features  visible  in  two  or  more  images.  The  process 
of  finding  correspondences  is,  however,  a  difficult 
task  in  itself.  One  key  issue  is  the  level  of  features 
at  which  this  matching  should  be  performed.  Low 
level  features  are  easier  to  compute  but  more  am¬ 
biguous.  Higher  level  features  are  easier  to  match 
but  difficult  to  infer  reliably  from  image  data.  We 
use  a  hierarchical  approach  where  features  are 
matched  at  various  levels  and  the  matching  at  one 
level  helps  compute  features  at  higher  levels.  We 
have  constructed  an  early  version  of  a  hierarchical 
multi-image  system  [Noronha  &  Nevada,  1997]. 

b)  Use  of  Multiple  Sources 

Multiple  sources  of  data  can  help  in  many  ways. 
One  is  in  estimating  heights  even  if  each  sensor 
only  gives  a  2-D  image.  We  can  also  take  advantage 


of  other  characteristics  of  the  sensors  as  the  differ¬ 
ent  sensors  may  be  better  at  extracting  different 
kinds  of  information.  For  example,  IFSAR  images 
have  height  information.  In  an  ideal  case,  certain 
kinds  of  features,  such  as  a  flat  horizontal  roof,  may 
be  found  just  by  thresholding  on  height.  Surface 
boundaries  may  be  found  by  first  derivative  discon¬ 
tinuities  in  the  image.  However,  IFSAR  data  is  like¬ 
ly  to  contain  holes  and  some  of  the  values  may  have 
gross  errors.  Thus,  results  obtained  by  simple  pro¬ 
cessing  are  not  likely  to  be  of  sufficient  quality  for 
accurate  building  models.  We  believe  that  IFSAR 
images  will  be  highly  useful  for  detection,  but  de¬ 
tailed  description  and  delineation  may  require  use 
of  supplementary  sensor  data. 

This  is  illustrated  in  Figure  3.  The  results  using  IF¬ 
SAR  alone  are  obtained  by  thresholding  height  to 
get  roof  locations  and  the  EO  results  are  derived  by 
our  monocular  building  detection  system  with  the 
approximate  location  given  by  the  IFSAR  roof  ar¬ 
eas;  these  results  are  shown  only  to  indicate  the  po¬ 
tential  of  combining  sensor  modalities  and  not  to 
imply  these  problems  are  solved. 

Multispectral  and  hyperspectral  images,  if  available 
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Figure  2  Schematic  Diagram  of  the  3-D  Grouping  Approach 
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(c)  Buildings  using  IFSAR  alone 


(d)  Buildings  using  EO  cued  by  IFSAR 


Figure  3  Building  extraction  using  EO  and 
IFSAR 


can  aid  in  the  processing  in  many  important  ways. 
They  can  be  useful  in  detecting  vegetation  and  sur¬ 
face  material  and  may  help  in  distinguishing  be¬ 
tween  natural  and  cultural  features.  They  can  be 
useful  for  distinguishing  shadow  regions  (shadow 
regions  have  different  spectral  properties,  in  a  nor¬ 
mal  color  image  they  typically  have  a  higher  blue 
component  from  scattered  light).  Hyperspectral  im¬ 
ages  should  also  aid  in  segmentation  of  images  as 
different  kinds  of  materials  can  be  clustered  by 
spectral  properties  more  easily  than  by  intensity 
alone.  IR  images  may  also  be  useful  for  detecting 
vegetation  and  some  components  of  a  3-D  structure. 


c)  Perceptual  Grouping 

Much  of  the  information  about  an  object  resides  in 
its  geometric  properties.  However,  the  geometry  of 
the  object  needs  to  be  inferred  from  the  image.  Ear¬ 
ly  segmentation  processes  typically  give  low-level 
features  that  are  fragmented  and  the  desired  object 
features  are  not  separated  from  those  in  the  back¬ 
ground  or  those  due  to  surface  markings,  shadows 
or  noise.  The  goal  of  perceptual  grouping  is  to  orga¬ 
nize  these  features  into  meaningful  groups  that  give 
the  desired  object  geometry.  Use  of  certain  kinds  of 
sensors  may  provide  better  connected  features  with 
fewer  distracting  ones,  but  the  process  of  perceptual 
organization  is  still  required. 

Our  approach  to  perceptual  grouping  is  a  hierarchi¬ 
cal  one.  Lower  level  features,  such  as  lines,  are 
grouped  into  successively  higher  levels,  such  as 
parallel  or  symmetric  lines,  which  in  turn  are 
grouped  into  features  that  may  correspond  to  sur¬ 
faces  of  desired  objects.  The  surfaces  may  then  be 
further  grouped  to  give  volumetric  objects.  Multiple 
hypotheses  are  possible  at  each  level.  Our  approach 
is  to  select  among  them  only  at  levels  where  suffi¬ 
cient  information  is  available  to  do  so.  This  results 
in  many  hypotheses  being  generated  at  each  level. 
Also,  the  grouping  process  may  use  either  2-D  or 
3-D  features,  but  the  final  verification  step  for  the 
objects  should  use  3-D  reasoning  wherever  possi¬ 
ble. 

The  properties  that  are  used  for  grouping  and  for  se¬ 
lecting  among  possible  groups  are  of  key  impor¬ 
tance.  We  believe  that  these  properties  should  de¬ 
rive  from  an  analysis  of  expected  invariances  for 
classes  of  objects  under  various  imaging  condi¬ 
tions.  For  example,  we  know  that  surfaces  of  a  rect¬ 
angular  parallelepiped  will  project  to  parallelo- 
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Figure  4  Buildings  detected  by  multi-view  hierarchical  system. 


grams  under  orthography  (generally  apphcable 
when  the  sensor  is  relatively  far  compared  to  the 
size  of  the  object).  Next,  we  show  an  example  of 
building  detection  using  multiple  images.  We  have 
developed  a  system  that  combines  matching  and 
grouping  operations  in  a  hierarchical  fashion  [No- 
ronha  &  Nevatia,  1997].  Figure  4  shows  the  results 
of  this  procedure  on  a  portion  of  the  Ft.  Hood 
dataset  using  three  separate  views.  This  is  one  of  the 
most  difficult  parts  of  the  Ft.  Hood  scene;  trees  ob¬ 
scure  some  of  the  sides  and  buildings  have  roofs  at 
multiple  heights.  Note  that  there  are  no  false  alarms 
and  most  of  the  buildings  are  detected  and  delineat¬ 
ed  correctly. 

In  past  work,  it  has  been  common  to  use  geometri¬ 
cal  properties.  With  the  availability  of  multiple 


sources,  grouping  will  need  to  use  features  from 
multiple  images  and  combine  them  depending  on 
the  sensor  characteristics.  Combinations  using  geo¬ 
metric  properties  will  be  easier  than  combining  sen¬ 
sor  level  data.  We  will  also  need  a  method  to  accu¬ 
mulate  evidence  of  objects  from  a  variety  of  uncer¬ 
tain  sources. 

d)  Context  and  Domain  Knowledge 
Context  and  domain  knowledge  are  important 
sources  of  information  for  object  extraction.  Pres¬ 
ence  of  one  set  of  objects  can  help  reinforce  or  sug¬ 
gest  the  presence  of  others.  For  example,  in  an  air¬ 
port  complex,  the  hypotheses  for  an  airplane  and  a 
terminal  reinforce  each  other  if  proper  relations  ex¬ 
ist  between  them.  Extraction  of  a  certain  road,  or  a 
transportation  network  in  general,  helps  in  identifi- 
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cation  of  a  certain  building  and  in  predicting  loca¬ 
tions  of  other  buildings.  Extraction  of  a  road  and 
river  provides  cues  to  the  location  of  a  bridge  and  so 
on. 

Domain  knowledge  also  helps  us  choose  the  tools 
that  are  appropriate  for  the  task  and  in  choosing  pa¬ 
rameters  or  rules  for  the  algorithms.  Different  set¬ 
tings  or  methods  may  be  appropriate  for  processing 
in  rural  or  urban  areas,  or  in  areas  with  or  without 
heavy  vegetation,  or  in  presence  or  absence  of 
snow. 

e)  Use  of  Previous  Models  and  Maps 

In  many  cases  of  interest,  previous  maps  and  mod¬ 
els  may  exist  and  the  task  may  be  one  of  updating 
or  extending  them.  In  this  case,  we  need  to  register 
the  new  images  with  the  existing  model,  find  the 
differences  between  the  two  and  update  the  models 
with  descriptions  of  the  new  and  changed  features. 
The  process  of  finding  the  differences  consists  of 
computing  the  expected  visible  features  (for  the 
current  imaging  environment)  and  verifying  wheth¬ 
er  these  features  are  present  in  the  image.  Regions 
of  significant  difference  can  invoke  the  model  con¬ 
struction  process.  Some  capabilities  of  image  to 
model  registration  and  model  validation  have  been 
developed  as  part  of  our  RADIUS  effort  [Huertas, 
etai,  1995;  Huertas  &  Nevada,  1997] 

f)  Human  Interaction 

Even  though  the  goal  of  this  effort  is  complete  au¬ 
tomation,  it  is  likely  that  the  systems  that  can  be  de¬ 
veloped  in  the  near  term  will  not  be  perfect  and  will 
miss  some  objects  or  find  incorrect  ones.  A  mecha¬ 
nism  is  needed  to  edit  and  correct  them.  This  should 
not  require  a  user  to  invoke  completely  manual  pro¬ 
cedures;  in  many  cases,  it  is  sufficient  for  the  user  to 
provide  some  hints  to  the  automatic  system  to  re¬ 
compute  and  correct  the  problems.  We  have  some 
experience  with  such  an  approach  where  in  some 
cases  a  missed  building  can  be  found  simply  by  the 
operator  indicating  the  approximate  location  of  the 
building  and  a  possible  cause  of  the  failure  [Heuel 
&  Nevatia  1996].  Sometimes,  more  precise  interac¬ 
tion  may  be  needed,  but  it  should  still  not  be  neces¬ 
sary  to  revert  to  a  complete  manual  system.  For  ex¬ 
ample,  if  the  size  of  the  roof  is  corrected  by  the  user, 
the  height  can  be  recomputed  automatically  using 
the  same  procedures  as  the  automatic  extraction 
system.  This  work  is  reported  on  in  more  detail  in 
[Huertas  &  Nevatia  1997]. 


3  Evaluation  Plan 

We  are  developing  relatively  complete,  end-to-end 
systems  that  start  with  images  (and  some  collateral 
data  when  available)  and  produce  3-D  object  mod¬ 
els.  This  makes  it  easier  to  establish  evaluation  met¬ 
rics  and  to  test  the  systems.  We  describe  some  met¬ 
rics  and  an  evaluation  methodology  below. 

3.1  Metrics 

The  following  metrics  capture  issues  in  evaluating 
extraction  results: 

1)  Detection  rate;  How  often  are  the  desired  fea¬ 
tures  detected?  This  can  be  in  terms  of  the  ab¬ 
solute  number  of  detected  objects  or  may  be  by 
some  weighting  (such  as  by  size  or  by  impor¬ 
tance).  We  consider  a  feature  to  be  detected,  if 
there  is  any  overlap  between  a  detected  feature 
and  a  desired  feature  (this  could  be  modified  to 
include  a  certain  amount  of  overlap  or  a  certain 
minimum  accuracy  of  the  model). 

2)  False  Alarm  rate:  This  measures  the  frequen¬ 
cy  of  mistaken  detection.  A  feature  is  consid¬ 
ered  a  false  alarm  if  it  does  not  overlap  with  any 
desired  feature  (of  the  detected  class).  Again, 
the  rate  may  be  measured  in  terms  of  number  of 
objects  or  by  some  weighting. 

3)  Accuracy  of  Models:  This  is  more  difficult  to 
measure.  We  can  measure  errors  in  2-D  or  3-p. 
Typically  we  want  to  know  the  accuracy  in 
terms  of  size,  shape  and  location.  Size  error  can 
be  computed  in  terms  of  volumes  or  by  other 
parameters  such  as  area  and  height.  Shape  dif¬ 
ferences  may  be  harder  to  characterize,  except 
perhaps  by  the  amount  of  overlap  (in  2-D  or  3- 
D).  The  error  metric  can  be  made  specific  to  a 
shape,  for  example,  for  a  rectangular  stmcture, 
measurements  of  the  three  sides  and  the  center 
may  suffice. 

4)  Confidence  Factor:  We  expect  that  our  sys¬ 
tems  will  be  able  to  assign  confidence  factors  to 
the  detected  features  (and  even  to  components 
of  these  features  if  necessary).  These  could  be 
included  as  modifications  to  the  above  mea¬ 
sures.  For  example,  a  false  alarm  indicated  with 
lower  confidence  could  be  counted  as  being  less 
severe  than  one  with  a  high  confidence. 

3.2  Testing 

For  evaluations  to  be  meaningful,  the  system  niust 
be  tested  on  a  wide  variety  of  images  that  contain  a 
variety  of  desired  objects  in  a  variety  of  environ¬ 
ments,  and  imaged  under  a  variety  of  conditions 
(possibly  with  a  variety  of  sensor  types).  The  results 
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need  to  be  characterized  as  a  function  of  these  vari¬ 
ables. 

3.3  Demonstration  Plan 

Our  demonstrations  will  consist  of  taking  input  im¬ 
ages,  and  any  available  collateral  data,  and  display¬ 
ing  the  results  of  our  APGD  algorithms  and  to  com¬ 
pare  them,  in  some  cases,  with  hand  provided 
ground-truth  results.  We  expect  to  demonstrate  re¬ 
sults  on  a  variety  of  objects  using  a  variety  of  imag¬ 
ery  sources.  The  system  will  be  designed  to  run  au¬ 
tonomously  but  some  human  interaction,  either  to 
initiate  the  tasks,  or  to  edit  the  results  may  be  al¬ 
lowed. 

We  will  also  aid  in  integrating  our  system  with  the 
system  to  be  developed  by  the  APGD  IFD  contrac¬ 
tor  and  demonstrate  our  systems  in  a  larger  context. 
We  intend  to  develop  our  software  using  the  RCDE 
environment  which  should  simplify  integration 
with  the  IFD  contractor. 
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Abstract 

This  paper  presents  an  overview  of  the  pro¬ 
gram  of  research  at  the  CMU  Digital  Mapping 
Laboratory  in  the  analysis  of  remotely  sensed 
imagery  and  the  construction  of  virtual  world 
databases.  We  report  progress  in  the  areas  of 
digital  photogrammetry,  automatic  and  semi¬ 
automatic  building  extraction,  road  extraction, 
stereo  analysis,  multispectral/hyperspectral  im¬ 
age  analysis,  and  virtual  world  construction  for 
distributed  simulation. 

1  Introduction 

This  paper  presents  our  annual  overview  of  re¬ 
search  conducted  at  the  Digital  Mapping  Labo¬ 
ratory  (MAPSLab),  Computer  Science  Depart¬ 
ment,  Carnegie  Mellon  University.  The  MAP¬ 
SLab  is  supported  under  two  DARPA  research 
programs.  The  Rapid  Construction  of  Vir¬ 
tual  Worlds  (RCVW)  program  supported  by 
DARPA/ISO/SE,  USATEC,  NIMA/TMPO,  and  DMSO 
has  focused  on  the  end-to-end  construction  of 
visual  simulation  databases.  This  ranges  from 
cartographic  feature  extraction,  intensification 
of  standard  product  spatial  spatial  databases, 
selection,  generalization,  and  integration 

Our  research  under  the  Automatic  Population 
of  Geospatial  Databases  (APGD)  program  sup¬ 
ported  by  DARPA/ISO/IU  has  a  much  narrower 
scope  of  effort.  Here  we  are  addressing  research 
toward  automating  cartographic  feature  attri¬ 
bution  using  high  spatial  resolution  hyperspec- 
tral  imagery,  in  combination  with  our  existing 

The  Digital  Mapping  Laboratory’s  WWW  Home  Page 
may  be  found  at:  http://www.cs.cmu.edu/*MAPSLab 


cartographic  feature  extraction  (CFE)  systems 
running  on  panchromatic  imagery. 

In  this  section  we  briefly  overview  the  research 
activites  conducted  by  the  MAPSLab  in  each  of 
these  related  research  areas. 

1.1  RCVW  Research 

The  purpose  of  our  RCVW  contract  is  to  con¬ 
duct  a  program  of  basic  research  in  image  un¬ 
derstanding  and  automated  cartography  to  sup¬ 
port  the  efficient  representation  and  rapid  con¬ 
struction  of  virtual  world  databases.  Such 
a  system  would  take  available  panchromatic 
and  multispectral/hyperspectral  imagery,  exist¬ 
ing  map  and  digital  elevation  models,  knowl¬ 
edge  about  particular  environmental  and  geo¬ 
graphic  conditions  and  automatically  produce  a 
virtual  world  database  tailored  to  a  particular 
simulation  requirement.  While  currently  far  be¬ 
yond  the  state-of-the-art,  the  research  that  we 
proposed  addresses  several  issues,  which  when 
solved,  will  bring  us  closer  to  the  goal  of  low 
cost  construction  of  virtual  world  databases  in 
a  timely  manner,  produced  according  to  stan¬ 
dards  that  allow  for  sharing  and  interoperabil¬ 
ity. 

Our  activities  are  focused  on  three  related  re¬ 
search  areas:  efficient  representation  of  vir¬ 
tual  world,  spatial  database  intensification  to 
provide  improved  timeliness  and  level-of-detail 
for  virtual  world  database  construction,  and 
the  analysis  of  multispectral  imagery  for  sur¬ 
face  material  characterization.  Our  research  in 
cartographic  feature  extraction  to  support  geo¬ 
spatial  database  intensification  for  automated 
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simulation  database  construction  is  described  in 
Sections  2,  3,  and  4. 

Section  6  gives  a  brief  overview  of  our  work  in 
virtual  world  database  construction.  Progress 
in  efficent  representation  and  cartographic  data 
processing  is  described  in  Section  6.1  while  ex¬ 
amples  of  real-world  database  construction  used 
for  actual  exercises  are  described  in  Section  6.2. 

1.2  APGD  Research 

The  purpose  of  our  APGD  research  is  to  in¬ 
vestigate  the  use  of  high-resolution  hyperspec- 
tral  imagery  will  enable  more  detailed  and  accu¬ 
rate  surface  material  attributions  for  simulation 
databases,  especially  in  complex  urban  areas. 
Fusion  of  spectral  information  and  derived  sur¬ 
face  materials  with  existing  building  and  road 
extraction  systems  will  greatly  improve  both 
the  performance  of  such  systems,  by  enabling 
hypothesis  verification  based  on  material  type, 
and  the  cartographic  utility  of  their  output,  by 
the  addition  of  semantic  attributions  such  as 
material  type. 

The  focus  of  most  cartographic  feature  extrac¬ 
tion  activity  is  the  production  of  geo-spatial 
geometric  descriptions  of  manmade  structures 
and  of  the  nature  terrain.  Under  the  APGD 
project  we  will  investigate  the  automated  ex¬ 
traction  of  semantic  attribution  information  for 
these  features  by  the  fusion  of  hyperspectral  and 
panchromatic  imagery.  While  our  ongoing  re¬ 
search  in  RCVW  with  panchromatic  imagery 
has  focused  on  the  geometric  aspects  of  carto¬ 
graphic  feature  extraction,  the  generation  of  de¬ 
tailed  surface  material  maps  as  well  as  the  attri¬ 
bution  of  the  composition  of  man-made  objects 
has  not  until  now  been  the  subject  of  detailed 
analysis.  Section  5  describes  the  acquisition  of 
the  HYDICE  data,  previously  funded  under  the 
RCVW  program,  and  the  ongoing  research  in 
highly  accurate  geometric  registration  and  ra- 
diometic  correction  that  we  believe  are  required 
for  high  quality  information  fusion. 

2  Automated  and  Semi-Automated 
Building  Extraction 

Building  extraction  from  aerial  images  contin¬ 
ues  to  be  a  major  component  of  our  research, 
and  the  ability  to  relieve  or  eliminate  the  man¬ 
ual  burden  of  building  model  compilation  be¬ 
comes  increasingly  important  in  light  of  the 
need  to  populate  simulation  databases  with 
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geospecific  building  models.  Our  research  phi¬ 
losophy  for  the  building  extraction  problem  has 
been,  and  continues  to  be,  predicated  on  three 
factors:  the  use  of  rigorous  photogrammetric 
camera  models  to  model  image  and  scene  ge¬ 
ometry,  the  assertion  that  multiple  cooperative 
analysis  techniques  are  necessary  to  achieve  ro¬ 
bust  performance  on  a  wide  variety  of  com¬ 
plex  urban  and  suburban  scenes,  and  the  thor¬ 
ough  performance  evaluation  and  analysis  of 
our  systems[McKeown,  1996;  McKeown  et  al, 
1996c]. 

In  previous  work  [McGlone  and  Shufelt,  1994], 
we  showed  that  photogrammetric  knowledge 
could  serve  a  key  role  in  increasing  the  perfor¬ 
mance  of  feature  extraction  systems.  In  Sec¬ 
tion  2.1,  we  discuss  PIVOT,  a  new  monocu¬ 
lar  building  extraction  system  completed  dur¬ 
ing  the  past  year,  which  uses  photogrammet¬ 
ric  information  at  every  phase  of  its  processing 
to  achieve  increased  performance  on  complex 
and  dense  scenes.  In  Section  2.2,  we  discuss 
improvements  and  modifications  to  SiteCity, 
a  semi-automated  multi-image  building  extrac¬ 
tion  system,  and  show  examples  of  site  mod¬ 
els  generated  with  this  system.  The  results 
obtained  from  both  the  automated  and  semi- 
automated  systems  emphasize  the  importance 
of  using  rigorous  camera  models  for  feature  ex¬ 
traction  [McKeown  and  McGlone,  1993]. 

We  have  made  substantial  progress  in  perfor¬ 
mance  evaluation  and  analysis  of  our  systems. 
In  Section  2.3,  we  describe  the  methodology 
for  2D  image  space  and  3D  object  space  per¬ 
formance  evaluation  of  our  automated  building 
extraction  systems  on  83  test  images  covering 
18  sites,  as  well  as  a  more  recent  feature  ex¬ 
traction  experiment  for  the  Rapid  Construction 
of  Virtual  Worlds  (RCVW)  research  program, 
which  evaluated  performance  of  our  automated 
and  semi-automated  systems  on  26  test  images 
covering  4  sites  over  Fort  Hood,  Texas. 

Our  research  in  building  extraction  has  followed 
the  cooperative  methods  paradigm  [Shufelt  and 
McKeown,  1993],  which  states  that  no  one  tech¬ 
nique  will  be  sufficient  to  handle  the  variety 
of  complex  scenes  that  will  be  encountered  in 
the  real  world.  By  developing  systems  with  a 
variety  of  different  error  modalities  and  scene 
analysis  techniques,  and  merging  the  results,  it 
is  possible  to  achieve  robust  performance  on  a 
wider  set  of  imagery.  Recent  work  in  the  fusion 


of  monocular  building  models  with  dense  eleva¬ 
tion  maps  derived  from  stereo  matching  repre¬ 
sents  one  application  of  the  cooperative  meth¬ 
ods  approach.  Preliminary  results  of  that  work 
is  described  in  Section  4. 

2.1  Developments  in  automated 
building  extraction 

Traditionally,  computer  vision  techniques  for 
automated  building  extraction  have  neglected 
the  use  of  photogrammetric  camera  modeling  as 
a  source  of  geometric  information,  instead  treat¬ 
ing  the  image  as  the  sole  source  of  information. 
This  limited  view  of  the  problem  has  forced 
both  region-based  and  feature-based  techniques 
to  make  strict  assumptions  about  image  geome¬ 
try  and  scene  content;  consequently,  such  meth¬ 
ods  exhibit  poor  performance  on  imagery  where 
buildings  are  not  easily  segmented  by  inten¬ 
sity  criteria  alone,  or  where  complex  shapes  are 
prevalent  and  oblique  viewing  angles  violate  as¬ 
sumptions  about  image  acquisition  geometry. 

PIVOT  (Perspective  Interpretation  of 
Vanishing  points  for  Objects  in  Three 
dimensions),  is  a  new  fully  automated 
monocular  building  system  [Shufelt,  1996a; 
Shufelt,  1996c].  A  major  distinction  between 
PIVOT  and  the  systems  preceding  it  is  the 
thorough  integration  of  photogrammetric  mod¬ 
eling  in  all  phases  of  the  building  extraction 
process.  Another  important  distinction  is  the 
ability  to  model  buildings  as  combinations 
of  primitives,  simple  volumetric  models  of 
fixed  shape  but  variable  size  [Biederman, 
1985],  which  can  be  used  to  represent  complex 
structures  without  the  need  for  a  large  model 
library. 

PIVOT  has  five  major  steps  in  its  bottom-up 
processing  fiow: 

1.  Vanishing  point  detection  is  performed  after 
preprocessing  phases  of  edge  detection  and 
recursive  line  fitting.  This  process  makes  use 
of  the  camera  model  to  locate  dominant  hor¬ 
izontal  and  ’’peak”  roof  orientations  in  the 
image  features,  and  has  been  described  in  de¬ 
tail  elsewhere  [Shufelt,  1996b].  The  result  of 
this  process  is  a  set  of  line  segments  labeled 
with  plausible  3D  object  space  orientations. 

2.  Intermediate  feature  generation  uses  the 
orientation-labeled  line  segments  to  construct 
corners  and  2-corners  (pairs  of  corners  which 
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share  a  segment).  Each  feature  is  subjected 
to  a  geometric  consistency  test  which  com¬ 
pares  each  possible  orientation  labeling  with 
the  orientation  labels  of  the  primitives.  Only 
those  features  which  have  at  least  one  legal  la¬ 
beling  are  accepted  for  further  analysis.  This 
process  is  also  described  in  detail  elsewhere 
[Shufelt,  1997]. 

3.  Primitive  generation  uses  the  orientation- 
labeled  intermediate  features  to  perform  a 
local  search  for  plausible  instances  of  primi¬ 
tives.  These  primitive  instances  are  then  sub¬ 
jected  to  evaluation  for  image  support,  and 
the  best  scoring  primitive  for  each  2-corner 
is  retained  for  further  analysis. 

4.  Primitive  combination  and  extension  proce¬ 
dures  are  responsible  for  joining  primitives 
together  to  form  more  complex  objects,  and 
for  using  primitives  as  starting  points  for 
searches  for  other  primitives.  One  of  the  key 
ideas  behind  the  use  of  primitives  for  3D  ob¬ 
ject  modeling  is  that  they  represent  a  natu¬ 
ral  vocabulary  for  expressing  complex  shapes, 
eliminating  the  need  for  exhaustive  represen¬ 
tation  of  each  unique  shape  that  occurs  in  a 
scene. 

5.  Building  model  verification  employs  several 
tests  to  evaluate  the  final  set  of  combined  and 
extended  primitives.  The  evaluation  consid¬ 
ers  shadow  support,  photometric  consistency 
across  adjacent  surfaces,  edge  support  in  the 
image,  surface  intensity  homogeneity,  and  ob¬ 
ject  space  dimensions  and  geometry.  It  is 
worth  noting  that  the  shadow  modeling  is 
done  in  object  space,  unlike  many  other  sys¬ 
tems  which  use  image  space  approximations. 
PIVOT  also  makes  use  of  a  new  illumination 
constraint  based  on  the  known  solar  azimuth 
and  elevation  parameters  to  evaluate  the  con¬ 
sistency  of  surface  intensity  changes.  The  re¬ 
sult  of  building  model  verification  is  a  set  of 
3D  wireframe  building  models  in  geodetic  co¬ 
ordinates. 

We  briefly  show  the  results  from  a  PIVOT 
run,  contrasted  with  the  results  of  other  build¬ 
ing  extractors  developed  at  the  Digital  Map¬ 
ping  Laboratory  in  earlier  research  elforts.  Fig¬ 
ure  la  shows  an  image  of  a  complex  multi¬ 
level  structure  from  Fort  Hood,  Texas.  Figures 
lb  Id  show  the  automated  building  extrac¬ 
tion  results  from  BUILD-l-SHAVE,  VHBUILD, 
and  PIVOT,  respectively.  In  this  case,  PIVOT 


(c)  VHBUILD  results,  COMPLEX  JHN?!?.  (d)  PIVOT  results,  COMPLEX.FHN717. 

Figure  1:  Automated  building  extraction  results  for  a  Fort  Hood  test  scene. 


is  able  to  take  advantage  of  its  use  of  prim¬ 
itives  in  conjunction  with  a  rigorous  camera 
model  to  better  delineate  the  3D  structure  of 
the  complex  building,  albeit  with  a  higher  rate 
of  false  positives.  This  contrasts  with  VH¬ 
BUILD,  which  does  not  perform  full  object 
space  modeling  of  shadows  for  verification,  and 
with  BUILD-l-SHAVE,  which  does  not  use  cain- 
era  geometry  for  hypothesis  generation  or  veri¬ 
fication. 

While  PIVOT  can  achieve  impressive  perfor¬ 
mance  on  several  test  scenes  with  no  manual 
intervention,  it  is  still  the  case  thUt  typical  re¬ 
sults  from  fully  automated  systems  require  man¬ 
ual  editing  before  they  are  suitable  for  inclusion 
in  a  simulation  database.  The  following  section 
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discusses  recent  progress  in  SiteCity,  our  semi- 
automated  site  modeling  system. 

2.2  Developments  in  semi-automated 
building  extraction 

SiteCity  is  a  semi-automated  multi-image  site 
modeling  system  developed  at  the  Digital  Map¬ 
ping  Laboratory.  SiteCity  combines  interac¬ 
tive  measurement  tools,  image  understanding 
techniques,  and  a  rigorous  constrained  least- 
squares  photogrammetric  bundle  adjustment 
package  [McGlone,  1995]  for  site  model  genera¬ 
tion.  SiteCity  models  several  generic  classes  of 
buildings  with  semi-automated  tools,  known  as 
automated  processes,  which  use  lU  techniques 
to  alleviate  the  burden  of  manual  measurement 


(a)  Overhang  peaked  roof  buildings.  (b)  Creating  a  rotated  copy. 

Figure  2:  An  example  of  SiteCity’s  automated  copy  with  rotate  process. 


of  structures  in  multiple  images.  Thorough  de¬ 
scriptions  of  this  modeling  system  and  detailed 
usability  evaluations  have  appeared  elsewhere 
[Hsieh,  1996a;  Hsieh,  1996b];  in  this  section,  we 
focus  on  recent  developments  and  additions  to 
SiteCity. 

2.2.1  Model  generation  capabilities 

SiteCity  provides  several  tools  for  creating  new 
building  models  with  lU  assistance  as  well  as 
tools  for  constraining  points,  lines,  and  surfaces 
together  to  obtain  models  with  correct  geom¬ 
etry  during  the  multi-image  triangulation  pro¬ 
cess.  Recently,  additions  were  made  to  each  of 
these  toolsets  to  improve  the  ability  to  measure 
and  refine  building  models. 

The  original  version  of  SiteCity  contained  a 
set  of  object  constraint  tools,  which  allowed 
a  user  to  specify  coplanarity  and  collinearity 
constraints  among  point  sets,  lines,  and  sur¬ 
faces  of  building  models.  In  addition  to  aligning 
building  models,  these  tools  were  also  useful  for 
stacking  buildings  to  form  complex  multi-level 
structures  as  well  as  joining  walls  of  adjacent 
structures.  SiteCity  now  has  new  constraint 
tools  for  coplanarity  and  collinearity,  which  al¬ 
low  the  planes  and  lines  to  be  constrained  to 
horizontal  and  vertical  orientations.  These  tools 
are  useful  for  ensuring  that  constrained  objects 
are  not  only  tied  together,  but  also  lie  in  specific 
object  space  orientations.  Work  is  underway  to 

783 


implement  faster  constraint  solutions  and  more 
flexible  constraint  specifications. 

SiteCity  provided  a  number  of  automated  pro¬ 
cesses,  tools  which  could  be  invoked  by  the  user 
to  carry  out  certain  measurement  tasks  auto¬ 
matically.  One  of  these  tasks,  automated  copy, 
provides  a  simple  mechanism  for  repeatedly  in¬ 
stantiating  specific  models  at  particular  points, 
along  a  line,  or  within  a  polygon  in  the  image; 
after  the  point,  line,  or  polygon  has  been  spec¬ 
ified,  SiteCity  automatically  finds  instances  of 
the  model  in  the  vicinity  of  these  copy  regions 
and  performs  multi-image  matching.  However, 
the  copy  tool  only  performed  rigid  translations 
of  the  model  to  be  copied,  which  meant  that  ro¬ 
tated  versions  of  the  same  model  held  to  be  mea¬ 
sured  separately.  A  new  addition  to  SiteCity  is 
the  automated  copy  with  rotate  tool,  which  al¬ 
lows  the  user  to  specify  a  rotation  angle  in  the 
xy-plane  with  a  copy  region.  SiteCity  then  looks 
for  instances  of  the  model  in  the  copy  region 
with  the  desired  rotation  angle. 

Figure  2a  shows  several  peaked  roof  buildings 
with  overhanging  roofs,  of  which  all  but  one 
share  the  same  orientation.  The  remaining 
building  is  identical  to  the  others  except  for 
a  rotation  of  90  degrees,  and  in  previous  ver¬ 
sions  of  SiteCity,  measurement  of  this  rotated 
building  would  have  had  to  start  from  scratch. 
Automated  copy  with  rotate  now  allows  a  user 
to  simply  specify  a  rotation  angle  and  click  on 


(a)  STEW  scene. 


(b)  LEG02  scene. 


Figure  3:  Views  of  SiteCity  models  in  Fort  Hood  simulation  database. 


the  building,  and  SiteCity  performs  the  rota¬ 
tion  and  does  matching  to  position  the  rotated 
copy  of  the  building.  The  result  of  this  process 
is  depicted  in  Figure  2b.  This  feature  extends 
the  range  of  situations  in  which  SiteCity  can  be 
used  to  quickly  generate  new  ground  truth  sites. 

2.2.2  User  interface  issues 

Each  building  model  measured  in  SiteCity  is 
represented  as  a  hierarchical  3D  wireframe  ob¬ 
ject.  A  building  model  can  be  queried  at  the  vol¬ 
ume,  surface,  line,  or  point  level,  as  appropriate. 
Measurements  can  also  be  adjusted  at  dilferent 
levels  in  the  hierarchy,  which  gives  added  flex¬ 
ibility  in  positioning  model  points  and  bound¬ 
aries.  However,  the  original  SiteCity  interface 
required  the  user  to  select  among  the  voluine, 
surface,  line,  and  point  levels  before  any  queries 
or  adjustments  could  be  done.  SiteCity  now 
incorporates  a  2D  selection  mechanism  which 
allows  objects  to  be  selected  in  point-and-click 
fashion,  without  the  need  to  explicitly  specify 
the  desired  type  of  geometric  object  to  be  ac¬ 
cessed. 

While  SiteCity  has  been  used  to  generate 
the  ground-truth  models  used  for  simulation 
database  population  as  well  as  automated  build¬ 
ing  extraction  evaluation,  many  of  the  lU  al¬ 
gorithms  incorporated  in  the  system  remain 
active  topics  of  research.  For  example,  the 


multi-image  matching  routine  which  SiteCity 
uses  to  position  a  model  from  one  image  in  the 
other  images  continues  to  undergo  experimen¬ 
tation.  To  provide  better  support  for  these  ex¬ 
periments,  SiteCity  now  provides  a  display  tool 
to  show  which  edge  segments  and  corners  were 
used  by  SiteCity  to  choose  a  particular  build¬ 
ing  match.  These  display  diagnostics  are  useful 
in  determining  where  particular  matching  tech¬ 
niques  succeed  or  fail,  and  how  they  might  be 
improved. 

2.2.3  Ground  truth  model 
construction 

As  noted  earlier,  SiteCity  has  been  used  to  con¬ 
struct  a  wide  variety  of  complex  building  mod¬ 
els  for  use  in  evaluating  the  monocular  building 
extraction  systems,  as  well  as  populating  sim¬ 
ulation  databases.  To  support  experiments  un¬ 
der  the  APGD  and  RCVW  research  initiatives, 
SiteCity  has  been  used  to  generate  ground  truth 
building  models  for  several  test  areas  in  aerial 
imagery  covering  Fort  Hood,  Texas.  The  site 
models  for  these  areas  are  representative  of  the 
modeling  capabilities  of  SiteCity. 

Figure  3a  shows  a  view  of  the  STEW  test  scene, 
which  is  dominated  by  a  large  multi-level  com¬ 
plex  structure  in  the  center  of  the  scene.  Note 
that  the  structure  is  not  strictly  rectilinear;  the 
left  side  of  the  structure  has  an  angled  wing. 


Also,  the  central  receiving  area  of  the  building 
has  a  curved  overhanging  parking  roof  with  a 
hole  where  the  roof  joins  the  main  body  of  the 
building. 

Figure  3b  shows  a  view  of  the  LEG02  test  scene, 
which  consists  of  several  clusters  of  rectilinear 
buildings  with  complex  perimeters  and  multi¬ 
ple  holes.  SiteCity  models  these  structures  by 
constraining  together  multiple  instances  of  the 
“lego”  block  shape,  which  is  only  measured  once 
and  then  copied  as  necessary.  This  approach 
allows  for  fast  semi-automated  measurement  of 
these  structures,  but  also  places  heavy  demands 
on  the  constraint  solution  package.  We  are 
currently  working  to  optimize  the  constrained 
bundle  adjustment  solution,  a  very  necessary 
step  given  the  size  of  the  problems  addressed. 
For  instance,  the  LEG02  site  has  seven  sets  of 
buildings,  measured  on  seven  images.  Each  set 
of  buildings  was  solved  separately;  four  of  the 
sets  had  2462  parameters  (448  points  and  717 
constraints),  while  three  had  1874  parameters 
(336  points  and  542  constraints).  Fortunately, 
the  predictable,  sparse  structure  of  the  normal 
equation  matrix  lends  itself  to  efficient  forma¬ 
tion,  reduction,  and  solution  techniques.  We 
are  also  re-implementing  constraints  and  build¬ 
ing  models  in  more  efficient  ways,  and  working 
on  ways  to  generate  better  approximations  for 
the  parameters  so  that  fewer  iterations  are  re¬ 
quired. 

2.3  Performance  evaluation  and 
analysis 

There  has  been  a  relative  lack  of  useful  perfor¬ 
mance  evaluation  in  previous  research  in  object 
detection  and  delineation  in  aerial  imagery.  Fre¬ 
quently,  the  output  of  building  extraction  sys¬ 
tems  is  only  presented  graphically,  without  re¬ 
course  to  quantitative  evaluation  metrics,  and 
in  cases  where  the  scene  models  are  compared 
with  ground  truth  models,  the  metrics  used  give 
a  biased  or  unclear  measure  of  system  perfor¬ 
mance.  Although  some  work  in  the  literature 
has  been  accompanied  by  thorough  and  rigor¬ 
ous  performance  evaluation,  this  has  been  the 
exception  rather  than  the  rule. 

One  of  the  main  thrusts  of  our  research  in  build¬ 
ing  extraction,  and  our  broader  efforts  in  car¬ 
tographic  feature  extraction,  is  the  extensive 
application  of  rigorous  performance  evaluation 
and  analysis  of  our  research  systems.  We  be¬ 
lieve  that  detailed  performance  analysis  using  a 


wide  range  of  unbiased  metrics  is  crucial  to  un¬ 
derstanding  system  behavior,  and  we  have  de¬ 
veloped  several  such  metrics  during  the  course 
of  our  work  in  building  extraction. 

In  our  most  comprehensive  performance  evalu¬ 
ation  of  building  extraction  techniques  to  date, 
PIVOT  was  compared  to  three  of  our  existing 
monocular  building  extraction  systems,  BUILD, 
BUILD-hSHAVE,  and  VHBUILD  on  83  test  im¬ 
ages  covering  18  sites,  with  widely  varying  scene 
content.  The  results  of  that  study  showed  that 
each  system  achieves  its  best  performance  in  dif¬ 
ferent  areas  of  the  scene  complexity  space,  sup¬ 
porting  the  cooperative  methods  approach  for 
feature  extraction.  In  particular,  PIVOT  was 
shown  to  have  improved  performance  in  scenes 
with  high  object  density,  high  object  complex¬ 
ity,  and  high  image  obliquity,  consistent  with 
the  observations  that  PIVOT  uses  a  richer  rep¬ 
resentation  of  building  structure  and  employs 
a  rigorous  photogram  metric  camera  model.  A 
detailed  description  of  this  evaluation  can  be 
found  elsewhere  [Shufelt,  1996c]. 

In  more  recent  work,  we  designed  and  dissemi¬ 
nated  a  feature  extraction  study  guide  as  part 
of  the  RCVW  research  initiative,  for  presen¬ 
tation  at  the  Terrain  Week  ’97  meetings.  We 
evaluated  SiteCity  and  our  four  automated  sys¬ 
tems  on  the  four  new  sites  (26  test  images) 
described  in  that  document.  This  test  doc¬ 
ument,  and  the  briefing  charts  describing  our 
results  for  this  study,  are  available  on  our 
website  (http :  //www .  cs .  emu .  edu/ "MAPSLab) . 
The  systems  were  evaluated  along  several  di¬ 
mensions; 

•  Time  required  to  extract  features,  broken 
down  by  feature  extraction  processing  phases 
and  manual/automated  phases 

•  Space  requirements  for  feature  extraction  al¬ 
gorithms 

•  Level  of  geometric,  topologic,  and  attribute 
information  extracted 

•  2D  building  extraction  percentage,  branching 
factor,  and  quality  percentage,  expressed  in 
terms  of  true/false  positive/negative  classifi¬ 
cation  of  image  (pixel)  space 

•  3D  building  extraction  percentage,  branching 
factor,  and  quality  percentage,  expressed  in 
terms  of  true/false  positive/negative  classifi¬ 
cation  of  object  (voxel)  space 
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•  Performance  as  a  function  of  image  obliquity 
(nadir  vs.  low  oblique  vs.  high  oblique) 

•  Performance  as  a  function  of  building  com¬ 
plexity  (image  space  and  object  space) 

•  Performance  as  a  function  of  building  density 
(image  space  and  object  space) 

As  in  the  previous  study,  we  again  observed 
that  different  methods  performed  well  in  various 
parts  of  the  scene  complexity  space.  For  many 
of  the  test  images  used  in  the  Terrain  Week  re¬ 
port,  we  found  that  BUILD-t-SHAVE  achiev^ 
the  best  results,  due  to  the  prevalence  of  nadir 
views.  VHBUILD  and  PIVOT  typically  achieve 
better  performance  in  oblique  views,  when  more 
3D  building  structure  is  explicitly  available  in 
low-level  edge  information. 

These  studies,  and  other  related  performance 
evaluation  tasks  we  have  undertaken  in  the 
past  [Shufelt  and  McKeown,  1993;  McGlone 
and  Shufelt,  1994],  serve  as  examples  of  thor¬ 
ough  analysis  to  guide  research  into  feature 
extraction  techniques,  supported  by  intensive 
compilation  of  ground  truth  models.  Figure 
4  depicts  a  portion  of  a  simulation  database 
covering  Fort  Hood,  with  over  240km  of  road 
data  compiled  semi-automatically  by  techniques 
described  in  Section  3,  and  14  building  test 
sites  containing  241  buildings  compiled  with 
SiteCity,  ranging  from  simple  flat  rectilinear 
structures  to  multi-level  complex  buildings  with 
internal  holes.  While  the  compilation  of  such 
ground  truth  databases  remains  a  time  consum¬ 
ing  process,  we  feel  that  ground  truth  construc¬ 
tion  constitutes  an  essential  part  of  cartographic 
feature  extraction  research. 

3  Road  Network  Extraction 

Our  previous  work  in  automated  road  network 
extraction  has  explored  the  use  of  coopera¬ 
tive  methods  for  robust  feature  extraction  from 
high  resolution  aerial  imagery  ([McKeown  and 
Denlinger,  1988;  Zlotnick  and  Gamine,  1993]). 
We  continue  to  build  on  this  work,  investi¬ 
gating  the  use  of  scalable,  object-space  meth¬ 
ods  for  automated  and  semi-automated  road 
network  extraction.  The  problem  of  detect¬ 
ing  and  delineating  road  features  is  difficult 
because  roads  are  more  than  linear  features. 
At  spatial  resolutions  with  ground  sample  dis¬ 
tances  of  3  meters  or  better,  roads  are  complex, 
man-made  structures  composed  of  lanes  (and 


possibly  lane  markings),  medians,  shoulders, 
sidewalks,  over  and  under  passes,  width/lane 
changes,  etc.  Road  features  can  be  partially 
or  completely  obstructed  by  vehicles,  build¬ 
ings,  and/or  overhanging  vegetation.  Extrac¬ 
tion  techniques  should  take  advantage  of  the 
road  structure  available  at  high  resolution,  but 
they  must  also  be  scalable  if  they  are  to  be 
tractable  for  use  in  constructing  large-scale  fea¬ 
ture  databases.  The  use  of  object-space  based 
methods  provides  opportunities  to  use  real- 
world  knowledge  about  road  design  structure, 
as  well  as  factor  in  knowledge  about  the  param¬ 
eters  of  data  acquisition  (e.g.,  camera  angle,  sun 
angle,  time  of  day,  etc.) 

In  this  Section,  we  present  some  of  our  recent 
work  in  automated  and  semi-automated  road 
network  extraction.  In  Sections  3.1  and  3.2  we 
describe  recent  improvements  to  road  feature 
representation,  the  interactive  extraction  sys¬ 
tem,  and  discuss  its  performance  evaluation.  In 
Section  3.3  we  present  a  new  model,  with  pre¬ 
liminary  results,  for  object-space  road  tracking. 

3.1  Interactive  road  extraction 

We  are  continuing  to  develop  our  research  in 
semi-automated  methods  for  road  feature  ex¬ 
traction.  Our  interactive  road  network  extrac¬ 
tion  system,  called  IdLWoof,  uses  our  auto¬ 
mated  road  extraction  system  to  augment  the 
manual  extraction  of  road  networks  from  aerial 
images.  The  user  interface  is  built  on  our  Xll- 
based  image  display  library  (IDL).  As  briefly 
described  in  [McKeown  et  al.,  1996a],  IdLWoof 
allows  the  user  to  direct  execution  of  the  auto¬ 
mated  road  finding  and  tracking  processes,  as 
well  as  permitting  manual  delineation  and  edit¬ 
ing  of  road  segments.  We  use  IdLWoof  both  as 
a  test  bed  for  experimental  automated  methods, 
and  as  a  modeling  tool  for  manually  producing 
accurate  road  network  ground  truth. 

Our  recent  work  has  focused  on  improvements 
to  our  underlying  feature  representations.  The 
2D  feature  representation  and  output  formats 
have  been  extended  to  support  3D  objects.  In¬ 
tersection  models  are  now  supported,  and  the 
user-interface  has  been  augmented  to  support 
the  creation  of  these  models.  The  representa¬ 
tion  contains  hooks  for  other  feature  models, 
such  as  parking-lots,  bridges  and  over-passes, 
and  assymetric  roads.  We  have  also  added  sup>- 
port  for  feature  attributions,  so  that  road  types. 
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Figure  4;  Fort  Hood  simulation  database. 


names,  etc.,  can  be  carried  along  with  the  fea-  IdLWoof.  It  also  demonstrates  the  power  of  us- 

ture  models.  Annotated,  3D  road  network  data  ing  rigorous  photogrammetric  and  object-space 

from  IdLWoof  was  used  to  generate  the  scenes  representations, 

from  the  visual  simulation  database  shown  in 

Figure  4.  Future  work  will  concentrate  on  inte-  3,2  Performance  evaluation  and 
grating  these  new  feature  models  into  the  exist-  analysis 

ing  user  interface. 

The  quantitative  evaluation  of  a  system’s  per- 
Using  IdLWoof  to  compile  road  network  ground  formance  has  remained  a  central  theme  in  our 

truth  is  useful  for  several  reasons.  Extensive  use  feature  extraction  research.  Though  qualita- 

of  the  interface  helps  us  to  identify  and  improve  jjyg  analysis  is  useful,  it  is  fraught  with  subjec- 

awkward  or  inefficient  user  interactions,  as  well  ^jyg  biases  and,  most  times,  overlooks  the  nu- 

as  uncovering  bugs  in  the  representations.  The  ances  that  are  often  present  in  automated  re¬ 
generated  ground  truth  data  is  used  in  the  per-  suits.  Using  quantitative  measures  for  evalu- 

formance  analysis  of  our  automatic  road  net-  ation  permits  a  reliable  determination  of  mea- 

work  extraction  system.  Figure  5  shows  an  or-  surable  gains/failures,  thus  determining  the  real 
thographic  view  of  road  and  building  ground  utility  of  system  modifications, 
truth  databases  compiled  over  the  motor  pool 

area  of  Fort  Hood,  Texas.  Over  240km  of  road  The  basic  method  is  to  compare  results  against 

data  was  compiled  from  five  separate  2.5km  a  manually  compiled,  high  quality  ground  truth, 

square  images  (approximately  7600  X  7700  pix-  This  ground  truth  can  come  from  a  system  such 

els).  The  road  networks  from  each  image  were  as  IdLWoof,  or  it  could  be  an  existing  carto- 

joined  (in  object  space),  then  merged  with  the  graphic  product,  such  as  DTOP  data.  Corn- 

building  data,  and  finally  overlayed  on  an  ortho-  parisons  against  this  ground  truth  should  gen- 

photo  mosaic  of  the  five  source  images.  This  erate  measurements  that  reflect  the  quality  of 

result  shows  some  of  the  detail  and  the  extent  the  output,  and  lend  some  insight  into  where 

of  the  road  models  one  is  able  to  produce  with  system  improvements  are  most  needed. 
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Figure  5;  An  orthographic  view  of  road  and  building  ground  truth  databases  compiled  over  the 
motor  pool  area  of  Fort  Hood,  Texas.  The  feature  data  is  overlayed  on  an  ortho-photo 
mosaic  of  five  nadir  source  images. 
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We  used  our  existing  automated  road  tracking 
system  to  generate  results  that  we  compared 
against  our  ground  truth,  shown  previously  in 
Figure  5.  The  results  were  generated  by  man¬ 
ually  selecting  a  set  of  road  starting  points, 
then  providing  those  points  as  input  to  our  au¬ 
tomated  road  tracker.  Using  manual  starting 
points  decouples  the  errors  introduced  by  auto¬ 
mated  road  finding  from  those  introduced  dur¬ 
ing  tracking,  permitting  evaluation  of  road  de¬ 
lineation  independent  of  detection  errors. 

Table  1  presents  the  performance  results  of  this 
comparison.  The  two  data  sets  are  compared 
in  2D  to  determine  overlaps.  For  pixel-based 
measures  (like  redundancy),  we  create  masks 
from  both  the  ground  truth  data  and  the  au¬ 
tomated  results  and  then  compare  the  overlap¬ 
ping  regions.  For  feature-base  measures,  we  cre¬ 
ate  a  mask  from  the  ground  truth  data,  then 
compare  individual  road  features  against  that 
mask.  If  a  feature  overlaps  the  ground  truth 
mask  by  more  than  a  given  threshold  (typically 
set  to  75%),  then  that  feature  is  considered  a 
correct  hypothesis.  As  with  the  building  extrac¬ 
tion  performance  evaluation,  a  confusion  matrix 
is  compiled,  consisting  of  true- positives  (TP), 
true-negatives  (TN),  false-positives  (FP),  and 
false-negatives  (FN).  From  these  numbers,  sev¬ 
eral  performance  measures  are  computed: 

•  Branching  factor  (FP/TP)  —  Measures  the 
number  of  competing  hypotheses. 

•  Quality  factor  (TP/(TP+FP-fFN))  —  Mea¬ 
sures  the  overall  quality  of  the  result. 

•  %  GT  Explained  —  The  percentage  amount 
of  ground-truth  that  is  covered  by  the  result. 

•  %  Corrent  Hypotheses  —  The  percentage  of 
generated  hypotheses  that  are  considered  cor¬ 
rect. 

•  %  Redundancy  —  The  percentage  of  the  gen¬ 
erated  output  that  overlaps  itself.  This  is  use¬ 
ful  for  determining  how  much  extra  work  is 
being  done. 

It  is  important  to  use  several  performance  mea¬ 
sures  in  concert  to  obtain  an  accurate  evalua¬ 
tion.  For  example,  one  could  implement  a  clas¬ 
sification  method  that  labels  all  pixels  in  the  in¬ 
put  as  roads,  thus  yielding  a  %  GT  explained  of 
100%.  However,  the  branching  factor  would  be 


extremely  high,  indicating  that  there  is  problem 
with  the  optimality  of  the  classifier. 

One  can  see  from  Table  1  that  the  number  of 
correct  road  hypotheses  is  consistently  high  (70- 
80%),  though  the  amount  of  ground  truth  cov¬ 
ered  is  relatively  low  (20-30%).  This  is  because 
the  system  fails  to  track  about  70%  of  the  in¬ 
put  starting  points,  though  the  points  that  it 
tracks  are  accurately  tracked.  The  branching 
factor  measure  is  not  as  interesting  here,  as  all 
the  road  starting  points  are  considered  correct 
since  they  were  manually  chosen.  However,  it  is 
worth  noting  that  the  branching  factor  remains 
low,  implying  that  the  number  of  mistracks  and 
overshoots  are  few.  Future  work  will  extend  our 
performance  measures  to  include  performance 
analysis  in  3D. 

3.3  Toward  a  model  for  automated 
object-space  road  extraction 

Recently,  we  have  focused  on  improving  the  per¬ 
formance  of  our  existing  tracking  system.  As 
seen  in  some  of  the  performance  numbers  pre¬ 
sented  here,  it  implements  a  conservative  ap¬ 
proach  to  extending  road  starting  points.  From 
an  implementation  standpoint,  it  is  difficult  to 
use  this  system  to  explore  new  tracker  control 
scenarios.  Additionally,  there  are  many  inter¬ 
nal  parameters  that  are  pixel  based,  and  it  is 
difficult  to  change  parameter  settings  based  on 
“first  principles”.  For  these  reasons  and  others, 
we  were  motivated  to  develop  a  new  tracker  de¬ 
sign  that  would  resolve  most  or  all  these  prob¬ 
lems. 

Using  our  experiences  with  the  current  tracking 
system,  coupled  with  our  ideas  about  tracking 
scenarios,  we  have  developed  several  principles 
that  we’ve  used  to  guide  our  new  design.  Track¬ 
ers  should  be  composable  objects  with  well- 
defined  interfaces  so  that  new  tracker  control 
structures  can  easily  be  implemented.  If  track¬ 
ers  wish  to  share  information,  they  must  do  so 
through  a  parent  tracker,  and  that  tracker  must 
access  and  provide  information  only  through  the 
available  interfaces.  Coordinate  systems  are  ex¬ 
plicit,  so  that  generating  object-space  features 
is  straight-forward.  Finally,  the  implementation 
of  the  model  should  be  efficient,  so  that  it  is 
scalable,  and  so  that  it  remains  useful  in  an  in¬ 
teractive  context. 
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We  represent  a  tracker  as  a  generic  object  within 
the  system.  Each  type  of  tracker  {e.g.,  coopera¬ 
tive  tracker,  surface  tracker,  edge  tracker,  veri¬ 
fication  tracker)  is  implemented  as  a  specific  in¬ 
stantiation  of  the  generic  tracker  object.  Track¬ 
ers  are  not  manipulated  directly,  but  are  created 
and  modified  as  states  in  a  finite  state  machine. 
Each  tracker  operation  takes  a  state  as  input 
and  returns  a  new  state  as  output.  This  is  a 
modular  interface,  and  permits  us  to  easily  ex¬ 
trapolate  forward  or  move  backward  through  a 
tracker’s  history. 

The  model  and  operations  were  designed  with 
specific  applications  in  mind.  A  list  of  possible 
tracking  scenarios  includes: 

•  Cooperative  tracker  —  A  composition  of  two 
or  more  trackers.  There  are  two  versions  of 
this: 

—  Synchronous  —  Subordinate  trackers  ad¬ 
vance  in  lock-step. 

—  Asynchronous  —  Subordinate  trackers  are 
permitted  to  advance  independently. 

•  Multi-Image  tracker  (sequential)  —  This 
tracker  tracks  in  a  single  image  and,  when 
necessary,  switches  images  in  an  attempt  to 
continue  tracking. 

•  Multi-Image  tracker  (parallel)  —  This  tracker 
attempts  to  extract  the  same  feature  from 
2  or  more  images  simultaneously  (a  stereo 
tracker) . 

•  Directed  (guided)  tracker  -  Assumes  that 
“hints”  have  been  given  about  where  to  find 
the  road,  such  as  a  rough  sketch  or  some  pre¬ 
existing  data.  It  uses  this  information  to  ex¬ 
tract  a  more  detailed/accurate  road  feature. 

•  Multi-Resolution  tracker  —  A  hierarchy  of 
corresponded  data  has  been  provided  that 
will  be  used  to  extract  the  road  features.  We 
believe  this  to  be  a  special  case  of  the  guided 
and/or  multi-image  (parallel)  trackers. 

•  Coarse-Grained  tracker  —  A  tracker  that 
takes  “large”  steps  to  extract  the  next  set  of 
road  points.  One  example  of  such  a  tracker 
would  be  one  that  correlates  some  number  of 
sequences  of  profiles  in  a  single  “step  .  An¬ 
other  would  be  a  tracker  that  attempts  to  find 
parallel  lines  on  each  step. 


•  Verification  tracker  —  A  specific  path  is  pro¬ 
vided  along  which  the  tracker  is  required  to 
walk.  Some  estimate  of  the  goodness  of  the 
path  is  to  be  computed. 


Each  of  these  control  scenarios  presents  unique 
implementation  challenges  .  All  currently  fit 
cleanly  within  our  composable  tracker  design 
model. 

Thus  far,  we  have  used  our  composable  tracker 
model  to  implement  a  simple  cooperative  road 
tracker.  The  implementation  contains  three 
trackers:  a  typical  correlation-based  tracker, 
a  correlation-based  tracker  using  probabilistic 
scoring,  and  a  cooperative  tracker  that  controls 
the  correlation  trackers.  The  subordinate  track¬ 
ers  operate  independently,  with  the  cooperative 
tracker  synchronizing  their  steps. 

Given  manually  generated  road  starting  points, 
results  from  this  tracker  are  shown  in  Figure  6. 
Visually,  one  can  see  that  the  major  road  ar¬ 
eas  are  almost  completely  tracked.  There  are 
fewer  breaks  in  the  longer  road  segments,  and 
there  are  fewer  overshoots  at  the  ends  of  road 
segments.  We  continue  to  have  problems  in  the 
suburban  housing  areas,  and  this  is  largely  due 
to  dense,  overhanging  vegetation. 

The  performance  analysis  numbers  are  summa¬ 
rized  in  Table  2.  Comparing  these  numbers  to 
those  in  Table  1,  we  can  see  that  we  have  im¬ 
proved  in  almost  every  category.  We  have  low¬ 
ered  the  branching  factor,  quantitatively  show¬ 
ing  that  there  are  fewer  overshoots  and  mis- 
tracks.  The  quality  and  %  correct  numbers  are 
slightly  higher.  Notably,  the  amount  of  ground 
truth  explained  (covered)  has  more  than  dou¬ 
bled.  This  is  because  the  composable  tracker 
does  a  better  job  of  robustly  tracking  the  given 
roads.  The  %  redundancy  is  the  one  category 
where  results  have  worsened.  We  expected  this 
result  because  our  current  system  includes  over¬ 
lap  detection,  and  new  composable  tracking  sys¬ 
tem  does  not  yet  have  this  capability.  Our  cur¬ 
rent  efforts  are  focused  on  completing  the  im¬ 
plementation  of  the  composable  tracker  frame¬ 
work,  as  well  as  on  further  improving  it’s  per¬ 
formance. 
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Figure  6:  An  orthographic  view  of  Fort  Hood,  overlayed  with  the  automated  results  generated 
by  our  object-space  composable  tracker.  Manual  starting  points  were  used  so  that  the 
performance  of  the  tracker  could  be  measured  independent  of  detection  errors. 
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Table  1:  Performance  evaluation  measures  generated  using  the  current  image-space  automated 
road  tracking  system. 


Image 

%  Redundant 
Pixels 

FHN711 

0.32 

0.54 

23.61 

75.93 

15.77 

FHN713 

0.48 

37.91 

64.55 

27.61 

FHN715 

0.50 

0.51 

26.55 

66.67 

28.57 

FHN717 

0.60 

34.87 

74.34 

27.55 

FHN719 

0.32 

0.53 

27.83 

75.59 

20.19 

Table  2:  Performance  evaluation  measures  for  the  new  object-space  based  composable  tracker. 


Image 

Branching 

Factor 

mi 

%  Redundant 
Pixels 

FHN711 

0.20 

0.60 

58.79 

83.08 

46.77 

FHN713 

0.25 

0.58 

50.86 

79.82 

44.95 

FHN715 

0.28 

0.62 

72.82 

77.88 

68.14 

FHN717 

0.24 

0.68 

70.01 

80.45 

57.73 

FHN719 

0.26 

0.57 

64.36 

79.65 

62.79 

4  Experiments  with  Data  Fusion 

A  common  theme  throughout  MAPSLab  re¬ 
search  has  been  the  belief  that  no  individ¬ 
ual  computer  vision  technique  can  reliably  pro¬ 
vide  a  complete  scene  reconstruction.  Thus,  to 
achieve  good  performance,  we  need  to  gather 
a  variety  of  information,  extracted  by  various 
processes  from  multiple  imagery  of  the  area  of 
interest.  Then  we  need  to  synthesize  this  dis¬ 
parate  information  into  a  consistent  model. 

In  three-dimensional  scene  analysis,  the  goal  is 
to  generate  an  interpretation  of  the  scene  that 
is  as  close  as  possible  to  the  actual  scene  im¬ 
aged.  Such  an  interpretation  can  include  the 
delineations  and  heights  of  buildings,  the  cen¬ 
terline  and  width  of  roads  in  a  transportation 
network,  a  digital  elevation  model,  and  the  seg¬ 
mentation  and  classification  of  the  scene  by  sur¬ 
face  material. 

The  key  issue  is  the  integration  of  many  diflFer- 
ent  sources  of  partial  information.  This  prob¬ 
lem  appears  under  different  guises:  for  exam¬ 
ple,  given  a  set  of  different  scene  descriptions 
generated  from  a  single  image  using  a  variety  of 
techniques,  how  does  one  intelligently  combine 
such  partial  information?  The  introduction  of 
additional  sensor  types,  temporal  imagery,  and 
multiple-look  imagery  create  dimensions  along 
which  information  fusion  must  be  performed; 


as  such,  the  complexity  of  the  problem  can  in¬ 
crease.  In  some  cases,  increased  amounts  of 
data  provide  improved  information.  This  may 
not  necessarily  follow,  however.  Complex  sys¬ 
tems  having  different  sources  of  error  may  not 
reinforce  correct  partial  interpretations  nor  re¬ 
fute  incorrect  ones.  In  addition,  the  results  from 
different  sets  of  imagery  need  to  be  represented 
in  a  common  coordinate  system  in  order  to  per¬ 
form  any  fusion  whatsoever, 

4.1  Sources  of  information 

The  purpose  of  our  fusion  research  is  to  deter¬ 
mine  how  best  to  integrate  disparate  sources 
of  information  and  how  best  to  use  the  com¬ 
bined  data  to  facilitate  three-dimensional  scene 
analysis.  We  are  currently  working  four  princi¬ 
ple  sources  of  information  abstracted  from  near¬ 
nadir  and  oblique  imagery  over  the  area  of  in¬ 
terest.  They  are  building  hypotheses,  road  net¬ 
works,  elevation  models  generated  from  stereo 
matching  and  surface  material  classifications 
generated  from  multispectral  and  hyperspectral 
imagery. 

As  described  in  Section  2.1,  the  PIVOT  system 
generates  building  hypotheses  from  the  analy¬ 
sis  of  single  or  “monocular”  images.  The  key 
advantages  of  this  data  is  that  the  edges  of  ob¬ 
jects  are  well  localized  and  the  vertical  edges 
in  oblique  views  yield  good  relative  height  esti¬ 
mates.  However,  with  near-nadir  imagery,  poor 


792 


height  estimates  may  be  generated  due  to  lack 
of  verticals.  Also,  edge  breaks  cause  partial  or 
total  building  loss  (false  negatives)  and  trees, 
cars  and  grassy  quads  can  cause  false  positives. 

As  described  in  Section  3  a  combination  of  the 
output  of  multiple  road  trackers  which  is  com¬ 
bined  to  generate  the  road  segments.  The  road 
segments  are  then  linked  to  form  intersections 
and  full  road  networks.  This  road  network  seg¬ 
ments  a  build-up  area  into  logical  areas  for  anal¬ 
ysis,  which  can  be  used  to  control  the  applica¬ 
tion  of  other  processes.  However,  the  common 
occurrence  of  trees  along  the  sides  of  the  road 
can  make  tracking  difficult  in  some  areas.  Here, 
information  about  the  occurrence  of  high  vege¬ 
tation  near  a  suspected  roadway  {e.g.,  obtained 
from  stereo  and  surface  classification)  could  be 
used  to  provide  guidance. 

The  third  source  of  information  is  the  stereo 
analysis  of  the  areas  where  there  exists  stereo 
coverage.  We  have  semi-automated  the  US- 
ATEC  Digital  Photogrammetric  Compilation 
Package  (IdlJDPCP)  [Norvelle,  1992;  Norvelle, 
1981]  to  generate  multiple  stereo  interpretations 
of  built-up  areas  which  can  be  projected  into 
a  common  coordinate  system  and  combined  to 
form  a  mosaicked  surface.  The  good  points 
of  the  stereo  process  is  that  it  can  generate 
good  height  estimates  from  nadir  imagery  and, 
for  continuous  terrain,  the  matching  algorithm 
tracks  the  surface  very  well.  The  difficulty  with 
the  stereo  is  that  errors  often  occur  at  depth  dis¬ 
continuities,  and  that  the  edges  of  objects  are 
smoothed  by  the  area-based  process.  Finally, 
the  elevation  estimate  degrades  with  increasing 
obliquity. 

The  fourth  information  source  the  surface  ma¬ 
terial  classification  the  area  (See  section  5).  We 
use  both  automated  and  supervised  classifica¬ 
tion  of  high-resolution  hyperspectral  data  col¬ 
lected  by  the  HYDICE  sensor  to  segment  and 
classify  the  surface  according  to  material.  This 
information  is  useful  because  common  surfaces 
generally  share  the  same  surface  materials,  and 
the  material  classification  of  pixels  adjacent  to 
strong  edges  can  be  used  to  guide  building  hy¬ 
potheses.  However,  the  resolution  of  the  data 
is  generally  less  than  that  of  the  panchomatic 
images,  and  the  airborne  hyperspectral  sensor 
results  are  difficult  to  geoposition. 

There  are  alternative  ways  for  organizing  the 
three  dimensional  scene  reconstruction  threads 


into  a  combined  processing  approach.  The  ba¬ 
sic  division  is  either  into  a  bottom-up  (data  di¬ 
rected)  approach,  where  the  results  from  the  dif¬ 
ferent  methods  are  merged  together;  or,  a  top- 
down  (knowledge-directed)  approach,  where  the 
partial  results  are  used  to  guide  or  select  from 
other  approaches.  In  the  next  two  sections,  we 
present  our  initial  efforts  in  combining  the  par¬ 
tial  results. 


4.2  Bottom-up  fusion 

In  this  section  we  outline  several  ways  that  the 
results  of  the  feature  extraction/classification 
may  be  combined.  It  is  difficult  to  imagine 
how  to  perform  fusion  without  an  object  space 
framework  within  which  to  correlated  informa¬ 
tion.  Each  of  the  following  examples  requires 
the  choice  of  a  common  geospatial  coordinate 
system,  in  our  case  we  use  the  Universal  Trans¬ 
verse  Mercator  system,  since  it  is  used  by  many- 
data  sources  and  products.  But  the  key  notion 
is  that  image-based  fusion,  requiring  projection 
of  information  into  a  particular  image  coordi¬ 
nate  system  is  not  flexible  enough  to  support 
fusion  from  a  variety  of  sensor  or  cartographic 
sources. 


4.2.1  Stereo  fusion 


The  stereo  process  may  be  applied  to  multiple 
stereo  pairs  of  the  same  area.  This  process  may 
be  run  both  left-to-right  and  right-to-left  on 


each  pair  of  imagery  to  give  a  possible  2x^2^ 

stereo  results  for  n  images.  In  practice,  usually 
due  to  scale  diflFerences,  some  combinations  are 
not  good  candidates  for  stereo  matching.  The 
resultant  stereo  results  can  be  combined  to  re¬ 
duce  the  blunders  and  reinforce  actual  elevation 
values.  The  stereo  results  shown  in  Figure  7 
represent  the  averaged  values  from  six  stereo 
matches  after  outliers  have  been  removed. 


In  combination  with  the  stereo;  building,  road, 
individual  tree,  and  tree  canopy  hypotheses  may 
be  used  to  remove  incorrect  stereo  matches  or 
to  indicate  areas  where  the  stereo  matching 
may  be  skipped  due  to  obstruction.  In  addi¬ 
tion,  these  areas  may  be  marked  for  removal 
in  support  of  the  generation  of  a  digital  ele¬ 
vation  model  of  the  area.  Likewise,  buildings 
and  roads  may  also  be  used  as  constraints  in 
the  stereo  processing.  The  largest  single  prob¬ 
lem  in  the  stereo  processing  is  the  errors  due  to 
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(a)  AU  building  hypotheses.  (b)  Verified  building  hypotheses. 

Figure  7:  Using  stereo  elevation  to  verify  building  hypotheses. 


depth  discontinuity  and  the  associated  occluded 
areas.  Importing  constraints  to  the  stereo  prt^ 
cess  at  these  locations  should  improve  the  resul¬ 
tant  stereo  matching. 

4.2.2  Stereo  and  building  hypotheses 

The  quality  of  stereo  matching  and  monocular 
building  extraction  results  appear  to  vary  in¬ 
versely  to  each  other  in  their  ability  to  estimate 
height  from  images  of  having  different  obliquity. 
By  combining  hypotheses  from  the  two  meth¬ 
ods,  the  confidence  of  each  height  estimate  may 
be  weighted  according  to  the  obliquity  of  the 
original  imagery. 

Stereo  may  be  adjusted  for  the  smoothing  due 
to  the  correlation  mask  and  used  to  verify  or  re¬ 
ject  building  hypotheses  based  on  the  difference 
of  the  inside  and  the  surrounding  height. 

Figure  7a  shows  building  hypotheses  generated 
by  PIVOT  registered  in  object-space  (UTM) 
with  the  elevation  generated  by  the  stereo  pro¬ 
cessing. 

Here  those  hypotheses  that  enclose  raised  areas 
and  are  surrounded  by  a  surfaceat  a  lower  eleva¬ 


tion  are  considered  likely  candidates  for  build¬ 
ings.  In  this  case  the  criteria  selected  is  that 
the  interior  is  on  the  average  of  at  least  two  me¬ 
ters  higher  than  the  surrounding  ground  after 
correction  for  the  smoothing  due  to  the  stereo 
process.  The  verified  hypotheses  are  shown  in 
Figure  7b. 

4.2.3  Stereo  and  multispectral  surface 
material  classification 

Figure  8  shows  the  initial  overlay  and  fusion 
of  the  surface  material  classification  with  the 
height  above  the  ground  generated  by  the  stereo 
processing.  Here  the  value  component  of  each 
material  class  is  represented  by  the  elevation. 
We  can  isolate  raised  areas  representing  trees 
and  tree  canopies  as  well  as  providing  attribu¬ 
tion  to  building  features  such  as  the  asphalt  and 
sheet  metal  roofs  of  the  ‘L’-shaped  buildings  to 
the  left  and  right  respectively.  Figure  8b  shows 
the  asphalt  component  extracted  from  the  conti- 
bined  elevation/material  class  image  shown  in 
Figure  8a.  This  shows  clearly  the  raised  asphalt 
(lighter)  areas  and  the  ground-level  (darker)  oc- 
curances. 
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(a)  Surface  material  plus  elevation. 


(b)  Asphalt  surfaces,  showing  building  roofs,  parking  lots  and  roadways. 

Figure  8:  Combining  surface  material  classification  with  stereo  elevation  to  identify  materials  by 
type  and  height. 
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(a)  Original  stereo  elevation.  (b)  Road  network  from  Idl.Woof.  (c)  Updated  elevation  from  stereo. 

Figure  9:  Combining  road  tracks  with  stereo  elevation  to  remove  blunders  due  to  moving  vehicles. 


4.2.4  Stereo  and  road  networks 

When  roads  are  aligned  with  the  epipolars  of  an 
image  pair,  then  cars  moving  along  the  roads 
may  be  matched  and  their  motion  mistaken  for 
elevation,  with  the  cars  moving  one  direction 
appearing  at  a  higher  elevation  and  those  in  the 
opposite  direction  appearing  lower.  The  road 
network  can  be  superimposed  over  the  stereo 
and  artifacts  due  to  traffic  along  the  road  may 
be  removed  as  shown  in  figure  9. 


4.3  Top  down  fusion 

The  other  fundamental  approach  to  fusing  data 
from  different  feature  classification  methods  is 
to  use  one  to  focus  or  guide  another.  Here 
knowledge  obtained  about  the  scene  may  be 
projected  to  the  image  space  or  local  coordinate 
system  of  the  process  to  be  supported. 

For  example.  Figure  10  shows  the  areas  which 
appear  to  be  elevated  relative  to  their  local 
surroundings  after  analyzing  the  stereo  results. 
Here,  the  stereo  “blobs”  have  been  projected 
back  into  the  original  image  of  the  area  where 
they  may  be  used  to  either  limit  the  search  space 
by  providing  a  focused  areas-of-interest  or  to 
suggest  early  grouping  of  edges  in  the  forma¬ 
tion  of  building  hypotheses. 

This  should  help  to  improve  performance  both 
in  terms  of  extraction  quality  (fewer  false  pos¬ 
itives)  and  processing  time  (fewer  features  to 
consider) . 


5  Analysis  of  Hyperspectral  Imagery: 

HYDICE 

The  low-to-moderate  spatial  resolution  of  multi- 
spectral  data  generally  available  has  limited  its 
usefulness  to  generating  coarse  descriptions  of 
surface  materials  in  fairly  large,  homogeneous 
areas.  believe  that  the  high  spatial  reso¬ 
lution  hyperspectral  data  experimentally  avail¬ 
able  will  allow  us  to  generate  surface  material 
maps  with  dramatically  better  detail  and  fi¬ 
delity,  especially  in  urban  areas.  In  addition, 
the  higher  spatial  resolution  of  the  surface  ma¬ 
terial  map  will  make  it  applicable  to  carto¬ 
graphic  feature  extraction — we  will  be  able  to 
see  and  classify  parts  of  individual  buildings  or 
roads,  thereby  greatly  improving  our  hypothe¬ 
sis  verification  capabilities.  We  have  done  pre¬ 
liminary  work  in  using  multispectral  data  with 
moderate  spatial  resolution  in  conjunction  with 
high  resolution  panchromatic  imagery;  this  has 
raised  important  issues  in  multisensor  registra¬ 
tion,  cross-sensor  information  fusion,  spatial- 
temporal  differences,  and  in  new  techniques  for 
automated  material  classification,  as  well  as  ver¬ 
ifying  the  power  of  the  fusion  of  such  data. 

The  Naval  Research  Laboratory  (NRL)  Hyper¬ 
spectral  Digital  Imagery  Collection  Experiment 
(HYDICE)  sensor  system  is  a  210  channel  air¬ 
borne  scanner  capable  of  providing  the  high  spa¬ 
tial  resolution  imagery  that  we  believe  is  re¬ 
quired  to  support  detailed  spatial  and  spectral 
analysis.  The  sensor  is  mounted  on  a  CV-580 
aircraft;  depending  on  aircraft  altitude  above 
ground  level,  the  ground  sample  distance  (GSD) 
varies  from  0.75  to  3.75  meters.  The  HYDICE 
sensor  is  320  pixels  wide,  giving  a  ground  swath 


796 


Figure  10:  Stereo  elevation  may  be  used  to  localize  searches. 
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Figure  11:  Process  model  for  HYDICE  radiance  imagery. 


of  240  meters  up  to  approximately  a  kilometer. 
Its  spectral  range  extends  from  the  visible  to  the 
short  wave  infrared  (400  to  2500  nanometers) 
region,  divided  into  210  channels  with  nomi¬ 
nal  10  nanometer  bandwidths,  varing  from  7.6 
to  14.9  nanometers,  depending  on  channel  loca¬ 
tion  in  the  electromagnetic  spectrum.  The  spec¬ 
tral  bandpasses  of  three  traditional  multispec- 
tral  imaging  systems  (Daedalus  Airborne  The¬ 
matic  Mapper  (ATM),  Landsat  Thematic  Map¬ 
per  (TM)  and  SPOT  High  Resolution  Visible 
(HRV)  Imaging  Instrument)  partially  overlap 
the  spectral  resolution  of  the  HYDICE  sensor 
system. 

The  HYDICE  sensor  is,  geometrically,  a  linear 
pushbroom  sensor,  which  means  that  the  CCD 
array  is  oriented  perpendicular  to  the  direction 
of  flight  and  the  2D  image  is  formed  by  platform 
motion  (Figure  12).  Physically,  the  sensor  is  an 
area  array,  with  each  array  line  perpendicular  to 
the  direction  of  flight  recording  the  intensity  in 
a  different  spectral  band.  Each  image  line  con¬ 
sists  of  320  pixels  with  an  instantaneous  field  of 
view  pixel  of  0.5  milliradians,  giving  a  total  field 
of  view  of  approximately  9  degrees.  The  first  7 
and  last  5  pixels  in  an  image  line  are  outside  the 
optical  path  and  contain  no  image  data. 

During  October  24-27,  1995,  we  organized  a 
hyperspectral  data  acquisition  flown  over  Ft. 
Hood,  Texas  using  the  Naval  Research  Labo¬ 
ratory’s  (NRL)  HYDICE  sensor  system  to  sup¬ 
port  research  under  our  RCVW  program.  Ad¬ 
ditionally,  natural  color  film  was  shot  during 


the  HYDICE  collection  flights  and  ground  truth 
spectral  measurements  were  acquired  of  surface 
materials  to  be  imaged  by  the  HYDICE  sen¬ 
sor.  A  complete  description  of  the  data  ac¬ 
quisition  event  as  well  as  additional  technical 
details  are  available  [McKeown  et  al,  1996b; 
McKeown  et  al.,  1996a]. 

Figure  11  gives  an  overview  how  we  plan  to 
process  HYDICE  imagery  within  the  APGD  re¬ 
search  program.  Two  major  components,  radio- 
metric  and  geo-position,  must  be  addressed  in 
order  to  effectively  utilize  the  HYDICE  imagery 
for  information  fusion  in  cartographic  feature 
extraction.  From  the  radiometric  perspective, 
atmospheric  conditions  effect  spectral  scene  il¬ 
lumination  and  spectral  scene  radiance  reaching 
the  HYDICE  sensor.  In  order  to  compare  sur¬ 
face  material  properties  between  flightlines,  an 
atmospheric  correction  will  be  applied  in  order 
to  convert  HYDICE  radiance  imagery  to  appar¬ 
ent  reflectance.  This  image  conversion  provides 
a  framework  to  compare  surface  material  prop¬ 
erties  across  flightlines  and  to  utilize  spectral 
field  measurements  of  surface  materials  for  spec¬ 
tral  analysis. 

However  in  order  to  effectively  make  use  of  this 
information  highly  accurate  geo-positioning  is 
critical  for  fusing  surface  material  topologic  re¬ 
gions  with  information  derived  from  our  road 
and  building  feature  extraction  systems.  Here 
the  goals  diverge  somewhat  from  traditional  re¬ 
mote  sensing  where  precise  registration  is  niuch 
less  of  an  issue  as  aggregrate  level  descriptions 
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of  surface  coverage  or  vegetation  are  acquired 
as  much  coarser  spatial  resolutions. 

In  Section  5.2  we  describe  research  issues  as 
well  as  pragmatic  constraints  in  scene  registra¬ 
tion  for  the  HYDICE  sensor.  In  Section  5.2  we 
describe  the  process  for  radiometric  calibration 
and  preliminary  material  classification  experi¬ 
ments  performed  using  HYDICE  imagery. 

5.1  Registration  of  HYDICE  imagery 

Precise  image  registration  is  an  absolute  re¬ 
quirement  for  the  fusion  of  information  from 
multiple  images.  However,  registration  of  HY¬ 
DICE  imagery  presents  special  problems,  due 
to  its  linear  pushbroom  imaging  geometry  and 
dynamic  image  formation.  The  HYDICE  reg¬ 
istration  problem  consists  of  two  parts;  model¬ 
ing  the  perspective  geometry  of  each  single  line, 
and  modeling  the  platform  motion  along  the 
flight  line.  Our  approach  to  this  problem  uses 
navigation  sensor  information,  frame  imagery  of 
the  area,  and  geometric  information  within  the 
scene  to  obtain  the  most  rigorous  and  accurate 
registration  possible. 

The  HYDICE  sensor  is  mounted  in  a  stabilized 
platform  in  a  Convair  580  aircraft,  along  with 
other  navigational  and  imaging  sensors.  The 
navigation  equipment  includes  a  GPS  (Global 
Positioning  System)  receiver,  capable  of  differ¬ 
ential  operation,  and  an  inertial  navigation  sys¬ 
tem  (INS).  The  readings  from  all  these  sensors 
are  recorded  on  the  data  tape  in  real  time. 

Although  a  fairly  complete  suite  of  navigation 
sensors  was  available  this  does  not,  in  general, 
solve  the  registration  problem.  Even  in  ideal 
cases,  best  accuracies  of  several  to  10  meters 
on  the  ground  are  obtainable  for  the  imag¬ 
ing  geometries  specified  during  our  acquisition. 
When  performing  information  fusion  with  im¬ 
agery  having  a  GSD  of  0.3  meters,  this  accuracy 
is  not  adequate  by  itself  and  must  be  improved 
by  a  simultaneous  solution.  We  were  also  lim¬ 
ited  by  the  lack  of  integration  of  the  HYDICE 
and  navigation  sensors.  HYDICE  is  an  exper¬ 
imental  system,  with  its  main  thrust  being  the 
development  of  high  resolution  spectral  capabil¬ 
ities,  so  system  integration  to  improve  the  po¬ 
sitioning  capabilities  of  the  sensor  has  not  been 
a  priority.  This  greatly  complicates  the  solu¬ 
tion  for  the  HYDICE  imagery,  since  the  weak 
imaging  geometry  of  linear  pushbroom  sensors 
makes  any  navigation  information  very  useful. 


Xc(x),  Yc(x),  Ze(x),  co(x),  (|>(x),  k(x) 


Figure  12:  Linear  pushbroom  geometry. 

We  were  also  adversely  impacted  by  specific  ac¬ 
quisition  conditions;  due  to  time  constraints, 
the  Ft.  Hood  flights  were  made  during  turbu¬ 
lent  atmospheric  conditions  which  had  adverse 
effects  on  the  image  geometry, 

A  linear  pushbroom  sensor  can  be  thought  of  as 
a  frame  sensor  with  only  one  line  in  the  x,  or 
flight  line,  direction  (Figure  12).  The  collinear- 
ity  equations  [McGlone,  1996],  modified  for  use 
with  linear  pushbroom  imagery,  are: 


0  = 
y-yo  = 


'Xp-Xo' 
M  Yp-Ya 

.  Xp  —  Zc  _ 


w 


(1) 


5.1.1  The  sensor  and  platform  model 

where  the  x  coordinate  is  0,  y  is  the  image  co¬ 
ordinate  and  yo  is  the  principal  point  along  the 
sensor,  /  is  the  focal  length,  and  Xp,Yp,  Zp  are 
cartesian  world  coordinates  of  the  point.  The 
position  parameters,  Xc,  Yc,  Zc,  and  the  orien¬ 
tation  parameters  w,  <f),  k,  (which  determine  the 
orientation  matrix  M3, 3)  are  functions  of  time, 
or  equivalently,  line  number.  To  model  this  the 
value  of  each  parameter  {Xc,  Yc,  Zc,  w,  (f>,  k)  at  a 
particular  line  is  written  as  a  polynomial  func¬ 
tion  of  line  number  x.  The  block  adjustment  so¬ 
lution  then  actually  determines  the  polynomial 
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coefficients,  instead  of  the  parameters  them¬ 
selves. 

Not  all  of  the  six  orientation  parameters  can 
be  recovered  in  a  resection  solution,  due  to  the 
sensor  geometry.  The  linear  sensor  geometry 
means  that  the  (j>  (pitch)  angle  will  be  highly 
correlated  with  position  along  the  flight  line, 
while  the  narrow  field  of  view  and  lack  of  terrain 
relief  means  that  the  w  (roll)  angle  will  be  cor¬ 
related  with  the  cross-strip  position.  Without 
external  information,  such  as  angles  or  positions 
from  navigation  sensors,  the  a;  and  <j)  parame¬ 
ters  must  be  held  to  0  in  the  adjustment. 

To  prevent  having  to  solve  for  high-order  poly¬ 
nomials  the  flight  line  is  divided  into  sections, 
with  each  section  having  its  own  set  of  poly¬ 
nomials.  Continuity  constraints  are  written  at 
the  section  boundaries  to  insure  that  the  orien¬ 
tation  elements  are  identical  at  the  boundaries 
and  that  calculated  ground  positions  are  consis¬ 
tent  across  the  boundary. 

5.1.2  Block  Adjustment  of  HYDICE 

Standard  block  adjustments  use  control  and  tie 
points  to  establish  image  positions  and  establish 
the  relationships  between  images.  While  our 
block  adjustment  also  uses  discrete  points,  we 
have  extended  the  solution  to  include  object- 
space  geometric  constraints. 

We  make  extensive  use  of  straight  line  con¬ 
straints,  since  we  have  a  large  number  of  object- 
space  straight  lines  available  over  the  motor 
pool  area  and  these  provide  a  large  amount  of 
geometric  strength  in  determining  the  orienta¬ 
tion  parameters.  Measuring  a  straight  line  ex¬ 
tending  across  a  number  of  image  lines  effec¬ 
tively  provides  a  tie  point  in  every  image  line 
in  which  the  line  appears.  We  also  utilize  right 
angle  constraints  which  aid  in  determining  the 
K  (yaw)  angle  of  the  sensor. 

Three  sets  of  imagery  are  used  in  this  adjust¬ 
ment: 

•  The  HYDICE  imagery,  collected  in  nine  side¬ 
lapping  flight  lines.  The  imagery  was  flown 
at  an  altitude  of  approximately  4400  meters, 
giving  a  ground  sample  distance  (GSD)  of  2 
meters. 

•  Color  frame  imagery,  collected  on  the  HY¬ 
DICE  flights  with  a  KS-87  frame  reconnais¬ 
sance  camera  with  a  six  inch  focal  length  and 


a  five  inch  format.  Since  this  is  an  reconnais¬ 
sance  camera  instead  of  a  mapping  camera, 
it  has  not  been  calibrated.  The  imagery  was 
scanned  at  a  1  meter  GSD. 

•  The  RADIUS  Ft.  Hood  imagery.  This  con¬ 
sists  of  about  40  nadir  and  oblique  images, 
taken  with  a  frame  mapping  camera  and 
scanned  at  a  GSD  of  0.3  meters  for  the  ver¬ 
tical  images.  These  images  have  been  previ¬ 
ously  block  adjusted  using  surveyed  ground 
control,  and  provide  the  basic  geometric 
strength  for  the  adjustment. 

The  control  points  for  the  adjustment  were  orig¬ 
inally  surveyed  for  the  adjustment  of  the  RA¬ 
DIUS  images.  Tie  points  are  measured  be¬ 
tween  the  HYDICE,  KS-87,  and  RADIUS  im¬ 
ages,  along  with  straight  lines  and  right  angles. 

5.1.3  Block  adjustment  procedure 

The  block  adjustment  is  being  performed  us¬ 
ing  the  object-oriented  photogrammetry  pack¬ 
age  described  in  [McGlone,  1992;  McGlone, 
1995],  which  allows  the  utilization  of  images 
with  different  geometries,  along  with  the  rig¬ 
orous  incorporation  of  geometric  constraints. 

We  are  performing  the  block  adjustment  se¬ 
quentially,  beginning  with  individual  images, 
then  sub-blocks,  then  the  final  block  adjust¬ 
ment.  Each  individual  HYDICE  image  is  first 
adjusted  using  only  the  GPS  information  and 
holding  the  orientation  parameters  fixed.  The 
next  step  is  to  adjust  each  HYDICE  image  along 
with  its  overlapping  RADIUS  and  KS-87  im¬ 
ages,  using  the  measured  control  and  tie  points 
and  the  geometric  (straight  line  and  right  an¬ 
gle)  information.  Finally  we  will  perform  sub¬ 
block  adjustments  with  adjacent  HYDICE  im¬ 
ages  from  the  same  flight  line  and  from  adjacent 
flight  lines. 

This  procedure  allows  easier  editing  of  input 
data  for  mistakes,  since  adequate  redundancy  is 
furnished  by  the  frame  imagery,  while  minimiz¬ 
ing  the  solution  time  required  since  the  larger 
adjustments  are  started  with  good  orientation 
parameter  estimates.  This  is  an  important  con¬ 
sideration,  since  the  final  block  will  contain  48 
HYDICE  images,  approximately  30  frame  im¬ 
ages,  about  3000  constraints,  and  about  500 
points.  This  amounts  to  approximately  10000 
parameters. 
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5.1.4  Preliminary  results 


We  have  completed  the  GPS  adjustment  for  all 
of  the  HYDICE  images  and  have  completed  the 
solutions  with  point  and  geometric  information, 
along  with  overlapping  frame  imagery,  for  two 
HYDICE  images.  We  have  used  these  solutions 
to  understand  the  characteristics  of  the  image 
data  and  the  navigation  data,  and  also  to  de¬ 
velop  and  optimize  the  data  acquisition  and  so¬ 
lution  procedures. 


At  this  point,  we  have  achieved  registration  ac¬ 
curacies  for  the  HYDICE  images  of  5-10  me¬ 
ters  (2-5  pixels).  An  orthoimage  of  image  fhhy- 
dice4J  is  shown  in  Figure  13.  This  was  produced 
from  a  solution  dividing  the  image  into  four  sec¬ 
tions,  using  cubic  polynomials  for  the  Xc  and 
Yc  positions  and  linear  functions  for  the  Zc  and 
K  parameters.  The  overall  deformation  of  the 
imagery  is  shown  by  the  curved  edges  of  the 
orthoimage,  while  the  removal  of  this  general 
trend  is  evidenced  by  the  fact  that  the  roads 
are  straight.  At  this  level  of  accuracy,  we  are 
up  against  imaging  distortions  in  the  sensor  and 
also  the  high-frequency  oscillations  (“wiggles”) 
present  in  the  imagery,  which  have  an  amplitude 
of  about  2-3  pixels. 


We  have  determined  that  the  high-frequency  os¬ 
cillations  cannot  be  modeled  using  orientation 
polynomials  of  reasonable  degree,  even  when  the 
strip  is  divided  into  very  small  sections.  Intro¬ 
ducing  splines  or  equivalent  models  with  suffi¬ 
cient  degrees  of  freedom  would  add  a  very  large 
number  of  unknowns  to  the  solution.  We  are 
therefore  investigating  procedures  to  remove  the 
high  frequency  deformations  before  solving  for 
the  overall  strip  geometry.  We  are  still  attempt¬ 
ing  to  utilize  the  angular  orientation  informa¬ 
tion  supplied,  which  would  be  the  most  direct 
method  to  remove  the  wiggles.  In  addition,  we 
investigating  image-based  methods,  using  either 
line-to-line  correlation  or  edge  continuity,  to  de¬ 
termine  the  high-frequency  deformations. 


5.2  Analysis  of  HYDICE  radiance 
imagery 
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Under  our  RCVW  and  APDG  projects  we  have 
performed  several  initial  experiments  with  the 
HYDICE  Calibrated  Radiance  Imagery  have 
which  involved  verification  of  radiometric  cali¬ 
bration  and  classification  of  HYDICE-generated 
simulated  Daedalus  Airborne  Thematic  Mapper 


Figure  13:  Orthoimage  of  FHHYDICE4.3. 


(ATM)  imagery.  The  following  sections  describe 
these  activities. 

5.2.1  Radiometric  calibration 

To  verify  the  radiometric  calibration  of  the  HY- 
DICE  imagery,  we  examined  the  correlation  be¬ 
tween  HYDICE  imagery  and  MTL  ground  truth 
data  by  plotting  spectral  radiance  curves  of  cal¬ 
ibration  panels  in  the  calibration  panel  imagery 
of  HYDICE  Flightlines  4  and  5.  For  each  gray 
level  panel,  we  computed  an  average  spectral 
radiance  curve  from  the  HYIDICE  imagery  and 
compared  it  to  a  spectral  radiance  curve  calcu¬ 
lated  from  the  MTL  ground  truth  data.  MTL 
provided  ground  truth  spectral  reflectance  mea¬ 
surements  for  the  calibration  panels  comprised 
of  8  to  12  measurements  per  gray  level  panel. 
The  average  spectral  reflectance  curve  for  e^h 
gray  level  panel  was  computed  and  multiplied 
by  the  spectral  downwelling  radiance  collected 
by  MTL  during  each  HYDICE  overflight. 

This  procedure  was  applied  to  Flightlines  4  and 
5  panel  imagery  producing  comparisons  of  the 
average  spectral  radiance  per  gray  level  panel 
between  the  HYDICE  measurements  and  the 
computed  MTL  ground  truth  data.  The  spec¬ 
tral  radiance  comparisons  for  Flightlines  4  and 
5  appeared  quite  good.  The  most  significant 
differences  are  in  the  visible  region  and  can  be 
attributed  to  the  atmospheric  effects  and  up- 
welling  radiance  not  present  in  the  MTL  ground 
truth  data.  This  comparison  work  also  lays 
the  groundwork  for  conversion  of  HYDICE  ra¬ 
diance  imagery  to  apparent  reflectance,  allow¬ 
ing  direct  comparison  and  matching  of  HYDICE 
spectral  apparent  reflectance  curves  with  MTL 
measured  spectral  reflectance  curves. 


5.2.2  Classification  experiments 

Due  to  the  volume  of  image  data  generated  by 
the  HYDICE  hyperspectral  sensor,  data  reduc¬ 
tion  methods  are  being  pursued.  A  first  at¬ 
tempt  at  data  reduction  involves  the  averaging 
of  HYDICE  bands  to  simulate  Daedalus  Air¬ 
borne  Thematic  Mapper  (ATM)  imagery.  Fig¬ 
ure  14a  shows  a  near  infrared  band  image  from 
a  portion  of  HYDICE  Flightline  4  (FHDAED4.3) 
using  band  7  from  the  simulated  Daedalus  ATM 
imagery. 


Table  3:  Surface  material  classes. 


asphalt 

gravel 

clay 

new  asphalt  roofing 

concrete 

old  asphalt  roofing 

coniferous  tree 

shadow 

deciduous  tree 

sheet  metal  roofing 

deep  water 

soil 

grass 

turbid  water 

Using  existing  multispectral  software,  surface 
material  classmaps  are  generated  with  a  max¬ 
imum  likelihood  classifier  for  the  surface  mate¬ 
rials  listed  in  Table  3.  The  selected  training 
sets  to  compute  the  above  spectral  class  statis¬ 
tics  originate  from  an  earlier  section  in  Flight¬ 
line  4  (i.e.,  FHDAED4.2).  This  flightline  seg¬ 
ment  contains  natural  vegetation,  bare  soil  and 
waterbody  features  along  with  barrack/motor 
pool  areas  very  similar  in  scene  content  as 
(FHDAED4J).  A  problem  that  arose  involved 
singular  covariance  matrices  for  the  waterbod¬ 
ies.  Upon  closer  inspection  of  the  waterbody 
training  set  statistics,  the  SWIR  bands  {i.e., 
Daedalus  ATM  band  9  and  10)  had  zero  inean 
and  standard  deviation  entries.  This  condition 
for  waterbodies  can  be  explained  upon  closer 
inspection  of  the  HYDICE  imagery. 

The  HYDICE  calibrated  radiance  imagery  from 
which  the  simulated  Daedalus  imagery  was  gen¬ 
erated  contains  zero  radiance  values  over  the 
waterbodies.  These  zero  radiance  values  are  a 
result  of  clipping  negative  radiance  values  en¬ 
countered  during  the  conversion  of  a  HYDICE 
Band-Interleaved  (BIL)  image  cube  to  CMU  im¬ 
age  format.  Negative  radiance  values  are  a  by¬ 
product  of  the  HYDICE  post-flight  data  pro¬ 
cessing  when  performing  the  radiometric  cali¬ 
bration  of  the  HYDICE  imagery  [Aldrich  et  al, 
1996]. 

Upon  visual  inspection  of  the  resulting  classi¬ 
fication  shown  in  Figure  14b,  vegetated  (e.g., 
grass  and  deciduous  tree)  and  asphalt  features 
were  labelled  quite  well.  With  the  inclusion  of 
roofing  materials,  building  rooftops  of  barracks 
and  motor  pool  facilities  are  visible  from  the 
surrounding  parking  lot  background.  Confu¬ 
sion  between  soil,  gravel,  and  concrete  features 
is  still  an  issue. 

Figure  15a  shows  a  close-up  of  test  area  RADT9 
which  is  located  left  of  center  in  Figure  14a.  The 
resulting  surface  material  classmap  is  shown  in 
Figures  15b,  15c  and  15d.  The  vegetated  and 
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(a)  Simulated  Daed2dus  ATM  nccu*  infrcired  band  image  (Band  7). 


coniferous  tree  ■  deciduous  tree  ■  deep  water  k  grass  | '  turbid  water  a  unclassified 

(b)  Surface  material  classification. 

Figure  14:  Surface  material  classification  from  simulated  Daedalus  ATM  image  FHDAED4-3. 
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(a)  Near  infrared  band  image.  (b)  Surface  material. 


(c)  Surface  material.  (d)  Surface  material. 


Figure  15:  Test  area  RADT9  surface  material  classification  from  simulated  Daedalus  ATM  image 
FHDAED4.3. 


804 


parking  lot  areas  along  with  asphalt  road  fea¬ 
tures  are  correctly  classified  while  confusion  be¬ 
tween  soil,  gravel,  and  concrete  features  is  still 
evident.  An  interesting  effect  of  roof  structure 
can  be  observed  in  the  group  of  peaked  roof 
buildings  located  in  the  center  portion  of  Fig¬ 
ure  14a.  The  peak  roof  structure  affects  the  illu¬ 
mination  intensity  and  type  (f.e.,  full  sun  versus 
full  shade)  incident  on  the  roofing  material.  The 
change  in  spectral  illumination  effects  the  spec¬ 
tral  content  of  these  building  roof  structures  as 
shown  by  their  assignment  to  sheet  metal  roof¬ 
ing,  old  asphalt  roofing  or  gravel  classes. 

6  Progress  in  Synthetic  Environments 

Constructing  large-scale  virtual  world 
databases  for  ground-based  simulation  re¬ 
quires  the  integration  of  information  from 
various  sources,  including  digital  map  data, 
aerial  and  satellite  imagery,  detailed  line  draw¬ 
ings,  and  ground-based  photography.  Such 
virtual  world  databases  have  significant  appli¬ 
cations  in  DoD  training,  mission  planning  and 
rehearsal,  and  autonomous  agent  simulation. 

Under  the  RCVW  program  we  have  focused  on 
three  aspects  of  database  construction  process. 
In  Section  6.1  we  discuss  research  in  the  pre¬ 
processing  of  existing  spatial  data  to  provide  a 
coherent  set  of  features  to  be  modelled  within 
the  visual  simulation.  In  Section  6.2  we  describe 
progress  in  simulation  database  construction  in¬ 
cluding  Camp  Pendleton  and  a  revisit  to  the 
STOW-E  terrain  skin. 

6.1  Efficient  Simulation  Database 
Representation 

When  planning  the  content  of  a  simulation 
database,  there  are  many  constraints  that  can 
restrict  the  amount  of  spatial  data  that  can 
be  represented  in  the  final  compiled  visual 
database.  One  of  the  first  steps  in  database 
construction  is  the  collection  and  preparation 
of  the  source  geospatial  data.  The  goal  is  to 
intelligently  reduce  the  the  geospatial  source 
data  while  still  retaining  sufficient  complexity 
to  create  a  reasonably  accurate  and  dense  vi¬ 
sual  simulation  database.  The  two  main  meth¬ 
ods  are  selection,  which  will  remove  certain  less 
important  features,  and  generalization,  which 
will  reduce  the  number  of  points  used  to  de¬ 
scribe  features  while  retaining  the  same  basic 
shape.  In  this  discussion  we  will  focus  mainly 


on  road  networks,  but  similar  issues  exist  for 
vegetation  coverage,  drainage  networks,  coast¬ 
lines,  and  built-up  areas. 

6.1.1  Selection 

Selection  of  road  networks  can  be  accomplished 
in  several  ways.  Currently  we  perform  simple 
selection  based  on  feature  attribution.  This  can 
involve  any  arbitrarily  complex  criteria  for  ex¬ 
amining  the  attributes,  however,  it  is  limited 
by  those  attributes  available  in  the  geospatial 
source  data.  In  many  cases,  only  attributes 
based  upon  intrinsic  properties  such  as  road 
width,  and  basic  road  type  may  be  guaranteed 
to  be  available.  Figures  16  and  17  show  a  sim¬ 
ple  example  of  selection  of  the  major  roads  in 
the  Camp  Pendleton  area  using  the  road  width 
attribute.  We  are  working  on  more  advanced 
solutions  that  interact  with  the  database  com¬ 
pilation  process  to  use  polygon  density  and  net¬ 
work  connectivity  as  selection  criteria.  Roads 
are  weighted  based  on  their  importance,  derived 
either  from  attributes  or  in  terms  of  connectiv¬ 
ity,  and  the  less  important  roads  are  removed 
until  the  polygon  density  falls  below  a  given 
set  of  budgets  assigned  for  the  visual  simulation 
database. 

6.1.2  Generalization 

Once  a  road  network  has  been  selected,  feature 
generalization  is  used  to  reduce  the  number  of 
points  needed  to  describe  a  road  while  retain¬ 
ing  the  same  basic  shape.  This  is  particularly 
important  since  a  road  may  be  made  up  of  an 
arbitrary  number  of  intermediate  points,  few 
dictated  by  topology;  some  are  artifacts  of  the 
feature  compilation  process.  Our  current  im¬ 
plementation  is  based  on  the  Douglas-Peucker 
distance  based  generalization  algorithm.  Each 
edge  in  the  roa.d  can  be  given  a  different  tol¬ 
erance,  or  a  default  tolerance  can  be  specified 
for  the  entire  database.  Common  generaliza¬ 
tion  algorithms  perform  on  a  feature  by  feature 
basis  and  this  can  create  significant  visualiza¬ 
tion  problems  when  the  associated  polygonal 
representation  is  generated  for  the  simulation 
database.  For  example,  since  portions  of  the 
feature  edges  can  be  moved,  they  can  poten¬ 
tially  intersect  with  other  edges  from  other  fea¬ 
tures.  In  dense  high  resolution  road  networks 
problems  with  proximity  and  point  of  closest 
approach  are  common. 
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Figure  16:  Original  DTOP  for  Camp  Pendle¬ 


ton. 


ters) . 


We  have  extended  the  basic  generalization  al¬ 
gorithm  to  prevent  intersections  from  occuring. 
Additionally,  a  proximity  distance  can  be  spec¬ 
ified,  either  on  a  per  edge  basis,  or  as  a  default 
value  for  all  edges  associated  with  a  road  fea¬ 
ture.  This  prevents  features  from  being  gener¬ 
alized  to  be  too  close  to  another  feature.  Com¬ 
mon  examples  occur  when  generalizing  across 
traditional  cartographic  layers,  such  as  trans¬ 
portation  and  drainage.  Figure  18  illustrate 
the  case  where  there  is  a  road  running  along 
a  river  and  generalization  without  proximity 
checking  would  create  an  unacceptable  overlap 


in  the  polygonal  representation  of  the  road  and 
riverbank.  Another  example  where  proximity 
checking  is  important  is  the  case  of  a  divided 
highway.  The  divided  highway  is  especially  in¬ 
teresting  since  in  the  cartographic  representa¬ 
tion,  each  collection  of  lanes  is  represented  with¬ 
out  any  reference  to  the  other.  So,  rather  than 
handling  this  as  a  special  case  using  feature  type 
or  attribution,  it  must  be  handled  more  gener¬ 
ally  at  the  geometric  level. 

The  ’’best”  generalization  distance  tolerance  to 
use  depends  partly  on  the  constraints  of  the  vi¬ 
sual  simulation  database  (polygon  limits,  den¬ 
sity,  etc.),  and  partly  on  the  cartographic  com¬ 
pilation  process  that  produced  the  feature  data. 
Some  source  feature  data  has  nearly  straight 
edges  with  several  extra  points  that  can  be  re¬ 
moved  without  noticeably  changing  the  edge, 
while  other  data  must  be  changed  significantly 
in  order  to  remove  the  same  percentage  of 
points.  Figure  19  shows  three  NIMA  datasets 
containing  road  networks  compiled  at  about  the 
same  cartographic  scale.  The  number  of  roads 
and  their  extent  is  not  relevant.  What  is  in¬ 
teresting  is  that  the  point  at  which  50be  re¬ 
moved  requires  generalization  tolerances  rang¬ 
ing  from  4  meters  to  13  meters.  These  obser¬ 
vations  lead  to  interesting  research  questions  in 
data  directed  generalization  rather  than  fixed 
thresholds  based  upon  map  scale  or  product 
type. 

6.1.3  Matching/extraction 

In  some  cases,  it  is  desirable  to  use  a  selected 
and  generalized  dataset  (Figure  20)  as  a  tem¬ 
plate  for  selecting  data  from  another  dataset 
(Figure  16).  For  example,  an  old  database 
might  be  rebuilt  with  new  data  or  with  a  bet¬ 
ter  generalization  algorithm.  Or  we  may  sim¬ 
ply  want  to  compare  cartographic  data  of  the 
same  area  from  two  different  sources.  The  basic 
procedure  is  to  first  create  a  corridor  of  a  given 
width  around  each  edge  in  the  template  dataset. 
Then,  any  edge  in  the  test  dataset  that  has  a 
certain  percentage  of  its  length  inside  a  corri¬ 
dor  is  kept.  This  produces  useful  initial  results, 
and  can  be  improved  by  changing  the  width  and 
percent  tolerance  parameters.  It  does  however, 
have  some  side  effects.  If  the  corridor  width  is 
too  high  or  the  tolerance  too  low,  there  will  be  a 
large  number  of  short  spur  edges  in  the  output. 
Likewise,  if  the  width  is  too  low,  there  will  be 
gaps  in  some  of  the  output  edges.  In  general,  it 
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Number  of  Road  Segments 


(a)  Original  data. 


(b)  Generalized  without  proxim-  (c)  Generalized  with  proximity 
ity  check.  check. 

Figure  18:  Proximity  check  for  generalization. 


Generalization  Tolerance  (m) 


Figure  19:  Tolerance  needed  to  reduce  data  by 
half. 


Figure  20:  Selected,  generalized  from  DTOP 
and  TIGER. 


Figure  21:  Extracted  data. 


seems  to  be  preferable  to  have  more  spurs  than 
gaps  as  they  can  be  easily  removed  with  some 
hand  editing  and  simple  automatic  tests.  Fig¬ 
ure  21  shows  the  result  of  matching  and  extrac¬ 
tion  using  the  road  network  in  Figure  20  as  a 
template  for  extraction  from  the  newer  dataset 
in  Figure  16. 

Once  extraction  has  been  performed  we  can 
compare  the  resulting  datasets  for  significant 
change.  This  is  done  by  searching  at  points 
along  an  edge  using  one  dataset  as  a  reference. 
The  result  is  a  list  of  potential  edge  matches  and 
the  points  of  the  edges  can  be  further  compared 
to  find  a  common  subset  of  points.  Once  the 
association  is  made  the  maximum  distance  be¬ 
tween  the  edges  and  the  total  area  between  the 


edges  can  be  computed.  Of  course,  this  com¬ 
parison  is  limited  by  the  spurs  and  gaps  cre¬ 
ated  by  the  extraction  process  which  can  signif¬ 
icantly  reduce  the  number  of  edges  successfully 
matched. 

6.2  Progress  in  Database 
Construction 

Our  early  work  in  the  area  of  automated 
simulation  database  construction  involved  the 
construction  of  Triangular  Irregular  Networks 
(TINs)  that  formed  a  simple  but  efficient  bare 
earth  terrain  skin  [Polis  and  McKeown,  1993]. 
This  was  closely  followed  by  the  development 
of  integrated  TINs  (ITIN)  that  produces  a 
highly  efficent  terrain  skin  which  can  incorpo¬ 
rate  roads,  bridges,  lakes,  drainage  features,  in¬ 
cluding  islands,  directly  into  the  terrain  skin 
with  minimal  manual  modeling.  Initial  efforts 
permitted  roads  to  be  automatically  modeled  to 
obey  physical  constraints  with  respect  to  road 
grade  and  side  slope  [Polis  et  ai,  1995].  ITINs 
have  been  shown  to  represent  the  character  of 
the  terrain  significantly  better  than  the  right 
triangle  representation  used  previously  at  the 
same  polygon  cost. 

Recent  developments  have  allowed  for  the  ex¬ 
perimental  incorporation  of  tree  canopies,  ar¬ 
bitrary  polygonal  areas  with  reduced  polygon 
budgets,  and  multi-level  tins  representing  sea 
state  and  bathymetric  surfaces.  Drainage  net¬ 
works  can  be  used  in  conjunction  with  a  DEM 
to  enforce  streams  that  flow  downhill  and  trans¬ 
portation  networks  can  be  used  to  generate  cut- 
and-fill  roads  integrated  into  the  terrain  skin  of 
various  widths  and  road  types. 

While  many  experimental  databases  have  been 
generated,  it  is  notable  that  the  MAPSLab 
ITIN  process  has  been  used  in  the  development 
of  several  key  STOW  environmental  databases, 
namely  STOW-E  (Central  Germany),  Range 
400,  and  Camp  Pendleton.  In  addition  the 
Prairie  Warrior  exercise  database  (Chorwan), 
Fort  Benning  MOBA  test  database,  and  the  Op¬ 
eration  Kirby  database  were  constructed  using 
these  tools. 

6.2.1  Camp  Pendleton  database 

During  early  1996,  a  40km  x  40km  integrated 
terrain  skin  of  Camp  Pendleton  was  gener¬ 
ated  to  support  and  visualize  Semi- Automated 
Forces  (SAF)  exercises.  The  Camp  Pendleton 


exercise  database  was  constructed  using  a  vari¬ 
ety  of  automated,  semi-automated,  and  manual 
methods.  We  will  briefly  describe  the  database 
and  our  role  in  constructing  it.  A  more  complete 
description  is  available  in  [Polis  et  al,  1997]. 

Terrain  was  modeled  with  ITINs  that  included 
bathymetry  as  well  as  terrain  elevation  data 
from  six  sources.  Built-up  areas  and  areas 
of  high  fidelity  were  accommodated  as  relaxed 
polygon  budget  constraints  for  the  ITIN.  Cer¬ 
tain  ground  based  features  were  integrated  into 
the  exercise  terrain  skin.  They  included  driv- 
able  roads,  simple  bridges,  lakes  and  rivers,  un¬ 
derwater  breaklines,  an  ocean  surface,  and  de¬ 
tailed  cliff  structure  along  the  Pacific  coast. 

The  Camp  Pendleton  terrain  database  is  sup¬ 
porting  development  of  synthetic  forces  con¬ 
ducting  amphibious  operations  within  a  Joint 
Task  Force.  Characterization  of  the  littoral  en¬ 
vironment  presents  special  challenges  in  devel¬ 
oping  advanced  synthetic  environments  at  the 
“seams”  where  air,  land  and  sea  intersect  at  the 
coastline  and  in  the  surf  zone.  The  bulk  of  our 
work  for  this  database  was  done  over  a  three 
month  period  that  ended  with  a  delivery  of  the 
final  integrated  terrain  skin  to  USATEC  in  the 
beginning  of  March,  1996.  This  included  a  visit 
to  Camp  Pendleton  and  a  detailed  tour  of  the 
significant  beach  structures  in  December,  1995. 

A  key  requirement  for  Camp  Pendleton  was  to 
model  the  cliff  structure  around  several  of  the 
important  beaches.  However,  structures  like 
cliffs  are  not  well  reflected  in  DTED  Level  II 
due  to  a  combination  of  sampling  issues  and  in¬ 
terpolation  in  the  construction.  Since  the  cliffs 
represent  an  impediment  to  movement  along  the 
beach,  we  were  forced  to  resort  to  hand  model¬ 
ing  of  the  cliffs  in  the  form  of  pairs  of  breaklines 
for  the  top  and  bottom  of  the  cliffs.  Figure  22 
shows  a  photograph  used  as  a  reference  during 
modeling  and  the  corresponding  view  of  the  fin¬ 
ished  database.  The  resulting  cliffs  present  an 
appropriate  barrier  to  mobility. 

6.2.2  Rebuilt  STOW-E  database 

In  1994  we  built  the  terrain  skin  for  the 
database  used  in  the  STOW-E  (Synthetic  The¬ 
aters  of  War-Europe)  demonstration.  The 
database  covered  a  64  x  84  km  area  in  Cen¬ 
tral  Germany,  and  was  created  with  an  early 
version  of  the  ITIN  software.  In  that  version, 
roads  were  the  only  feature  type  that  could  be 
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Figure  22:  Cliff  structure  at  White  Beach,  Camp  Pendleton. 


(d)  Regensburg  in  rebuilt  database, 


(c)  Regensburg  in  original  database. 


Figure  23;  Original  and  rebuilt  STOW-E  databases. 
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integrated.  As  a  result,  hand  modeling  was  nec¬ 
essary  for  water  bodies,  bridges,  and  a  variety 
of  specialized  features.  In  addition,  feature  inte¬ 
gration  wasn’t  always  done  smoothly,  resulting 
in  “killer  cuts”  (see  Figure  23a).  A  vehicle  trav¬ 
eling  cross  country  and  approaching  the  road 
from  the  left  has  only  a  few  seconds  warning  of 
the  steep  drop.  In  some  cases,  the  terrain  height 
and  texture  are  the  same  on  both  sides  of  the 
cut,  making  it  even  harder  to  recognize  the  sud¬ 
den  drop  ahead.  It  is  unrealistic  for  there  to  be 
no  warning,  and  in  fact  the  source  data  is  not 
detailed  enough  to  even  know  whether  there  re¬ 
ally  was  a  road  cut.  This,  the  cut  was  more 
of  an  artifact  of  the  integration  process  than  a 
carefully  constructed  model  based  on  knowledge 
of  the  real  world. 

The  ITIN  system  has  been  improved  greatly 
since  the  original  STOW-E  database  was  built. 
As  shown  in  the  Camp  Pendleton  database,  wa¬ 
ter  bodies  can  be  now  integrated,  and  bridges 
are  constructed  automatically  where  roads  cross 
water.  Linear  drainage  features  can  be  widened 
and  integrated  into  the  terrain  using  a  process 
similar  that  used  for  roads.  All  of  these  features 
are  integrated  more  smoothly  into  the  terrain, 
virtually  eliminating  killer  cuts.  We  can  now 
construct  a  fully  populated  databases  contain¬ 
ing  object  models,  not  just  a  basic  terrain  skin. 
Tree  canopies  are  generated  and  we  can  import 
models  of  individual  trees,  buildings,  and  other 
features  and  place  them  on  the  terrain.  These 
improvements  demonstrate  the  extent  to  which 
it  is  possible  to  rebuild  the  STOW-E  database 
without  resorting  to  hand  modeling. 

Figure  23  shows  two  views  each  of  the  origi¬ 
nal  and  rebuilt  databases.  Figure  23a  shows 
a  killer  cut  in  the  original  STOW-E  database. 
This  was  automatically  corrected  by  running 
the  road  over  the  hill  instead  of  through  it. 
The  result  is  shown  in  Figure  23b.  Figures  23c 
and  23d  compare  views  of  Regensburg,  a  large 
city  on  the  Danube.  This  area  was  extensively 
hand  modeled,  so  it  is  a  difficult  challenge  for 
the  software.  The  bridges  and  complex  river 
were  successfully  recreated  and,  in  fact,  the  au¬ 
tomatically  generated  bridge  is  neater  than  the 
hand  modeled  one.  Models  recovered  from  the 
original  database  were  placed  on  top  of  the  new 
terrain  skin.  In  the  upper  left,  one  of  the  tree 
canopies  generated  by  our  software  is  visible. 

Several  discrepancies  remain.  The  most  obvious 
is  the  missing  railroad  bridge  in  the  foreground. 


Since  we  do  not  yet  integrate  railroads,  we  did 
not  construct  railroad  bridges.  Another  obvi¬ 
ous  difference  is  the  area  just  left  of  image  cen¬ 
ter  which  has  a  standard  ground  texture  in  the 
original  database,  but  has  a  built-up  area  tex¬ 
ture  in  the  rebuilt  database.  This  is  because  the 
built-up  area  texture  was  applied  automatically 
to  all  polygons  identified  as  built-up  areas  in  the 
cartographic  source  data.  It  is  not  known  what 
additional  information  was  originally  used  to 
manually  determine  that  the  standard  ground 
texture  was  appropriate.  Perhaps  the  area  was 
known  to  be  a  park  from  photographs  or  a  visit 
to  the  site. 
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Abstract! 

This  report  summarizes  progress  in  image 
understanding  research  at  the  University  of 
Massachusetts  over  the  past  year.  Many  of  the 
individual  efforts  discussed  in  this  paper  are 
further  developed  in  other  papers  in  this 
proceedings.  The  summary  is  organized  into 
several  areas: 

1.  3D  Site  Modeling  from  Aerial  Views 

2.  Terrest  Terrain  Reconstruction  System 

3.  Terrain  Classification  /Force  Monitoring 

4.  Content-based  Image  Indexing 

5.  Learning  in  Vision 

6.  Miscellaneous  Related  Research 

The  research  program  at  UMass  has  as  one  of  its 
goals  the  integration  of  a  diverse  set  of  research 
efforts  into  systems  that  are  ultimately  intended  to 
achieve  robust,  real-time  image  interpretation  in  a 
variety  of  vision  applications. 

1. 3D  Site  Modeling  from  Aerial  Views 

1.1.  The  Ascender  Site  Modeling  System 

Under  the  DARPA/ORD  RADIUS  program, 
UMass  developed  the  ASCENDER  system 
(Automated  Site  Construction,  Extension, 
Detection  and  Refinement)  for  automatically 
populating  a  site  model  with  3D  building  models 
extracted  from  multiple,  overlapping  images  (both 
nadir  and  oblique)  of  the  site  (Collins  et  al.  1996). 
The  UMass  design  philosophy  emphasizes  model- 
directed  processing,  rigorous  3D  photogrammetric 
camera  models,  and  fusion  of  information  across 
multiple  images  for  increased  accuracy  and 
reliability.  The  Ascender  system  has  been 
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transferred  to  Lockheed-Martin  and  the  National 
Exploitation  Laboratory  (NEL)  where  it  has  been 
evaluated  on  classified  data  sets. 

The  Ascender  system  acquires,  extends  and  refines 
3D  geometric  site  models  from  aerial  imagery  with 
known  parameters.  To  acquire  a  new  site  model, 
an  automated  building  detector  is  run  on  one 
image  to  hypothesize  potential  building  rooftops. 
Supporting  evidence  is  located  in  other  images  via 
epipolar  line  segment  matching  in  constrained 
search  regions.  The  precise  3D  shape  and  location 
of  each  building  is  then  determined  by  multi¬ 
image  triangulation,  and  shape  optimization  under 
constraints  of  3D  orthogonality,  parallelness, 
colinearity  and  coplanarity  of  lines  and  surfaces. 
As  new  images  of  the  site  become  available,  they 
are  matched  to  the  partial  site  model  and  model 
extension  and  refinement  procedures  are 
performed  to  add  previously  unseen  buildings  and 
to  improve  the  geometric  accuracy  of  the  existing 
building  models.  In  this  way,  the  system  gradually 
accumulates  evidence  over  time  to  make  the  site 
model  more  complete  and  more  accurate. 

Based  on  initial  experience  in  the  evaluation  at  the 
NEL,  major  changes  have  been  made  to 
Ascender's  control  system.  The  original  system 
used  a  single  reference  image  to  generate  roof 
hypotheses  in  the  form  of  polygons,  and  then  used 
the  remaining  images  to  verify/reject  buildings  by 
constructing  a  3D  model.  If  a  building  hypothesis 
was  not  found  in  the  reference  image,  the  building 
would  not  be  constructed  even  though  it  might  be 
clearly  visible  in  one  or  more  of  the  other  images. 
A  new  control  strategy  has  been  implemented 
under  which  all  images  are  processed  uniformly; 
polygons  found  in  any  image  are  used  as  the  set  of 
initial  rooftop  hypotheses  from  which  the  3D 
reconstruction  begins. 

Tests  have  been  performed  on  a  subregion  of  the 
Fort  Hood  dataset.  Polygons  were  detected  in 
seven  images  and  redundant  polygons  eliminated 
on  the  basis  of  overlap.  Each  of  the  remaining 
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polygons  was  then  used  to  construct  a  3D  building 
model  Models  that  had  a  side  or  height  of  less 
than  5  meters  were  eliminated.  Using  this  scheme 
92%  of  the  76  rooftop  polygons  were  detected, 
leaving  six  polygons  missed  in  all  seven  images. 
An  additional  45  polygons  represented  false 
positives  from  either  errors  in  the  2D  grouping 
process  that  survived  verification  or  the 
reconstruction  of  a  cultural  feature  other  than  a 
building  (parking  areas,  playing  fields,  etc.)  that 
had  errors  in  height  due  to  limited  support  from 
the  image  set. 

1.2  Ascender  II:  Context  Sensitive  Control 
of  Reconstruction 

Work  on  the  Ascender  I  system  demonstrated  that 
the  use  of  multiple  strategies  and  3D  information 
fusion  can  significantly  extend  the  range  of 
complex  building  types  that  can  be  reconstructed 
(Jaynes  et  al.  1996).  Under  the  DARPA  APGD 
program,  we  are  designing  and  building  the 
successor  to  the  Ascender  system.  The  design 
approach  is  based  on  the  observation  that  while 
many  lU  techniques  function  reasonably  under 
constrained  conditions,  no  single  lU  method  works 
well  under  all  conditions.  Consequently,  work  on 
Ascender  II  is  focusing  on  the  use  of  multiple 
alternative  reconstruction  strategies  from  which 
the  most  appropriate  strategies  are  selected  by  the 
system  based  on  the  current  context.  In  particular, 
the  new  system  will  utilize  a  wider  set  of 
algorithms  that  fuse  2D  and  3D  information  and 
make  use  of  EO,  SAR,  IFSAR,  and  multispectral 
imagery  during  the  reconstruction  process.  We 
believe  that  such  a  system  will  be  capable  of  more 
robust  reconstruction  of  three  dimensional  site 
models  than  has  been  demonstrated  in  the  past  and 
will  significantly  reduce  the  effort  required  by 
image  analysts  during  the  reconstruction  process. 
This  in  turn  will  result  in  faster  development  of 
topical  situational  and  visualization  products  of 
military  significance. 

Ascender  II  is  organized  into  two  subsystems  The 
lU  components  of  the  system  responsible  for 
manipulating  image  data  are  being  constructed  in 
the  ARPA  RCDE  system,  while  the  control  and 
inferencing  components  are  represented  as 
Bayesian  belief  networks  constructed  using  the 
Hugin  system  (Jensen  1996).  Communication 
between  the  two  subsystems  is  currently  being 
supported  by  UNIX  socket  facilities  using  packets 
structured  for  this  application.  Control  policies 
(strategies)  are  associated  with  the  object  classes 
represented  in  the  network.  Execution  of  a  control 
policy  results  in  the  accumulation  of  evidence 
for/against  the  corresponding  network  node.  This 
evidence  is  propagated  through  the  network  based 
on  the  Bayesian  probability  tables  constructed  as 
part  of  the  knowledge  base.  Currently,  a  maximum 
uncertainty  policy  is  used  to  select  the  next  node  in 
the  network  to  be  expanded,  although  more 


sophisticated  mechanisms  are  being  explored  as 
part  of  the  research.  A  second  issue  being 
examined  relates  to  the  appropriate  granularity  of 
the  control  policies  and  therefore  the  granularity  of 
the  reconstruction  systems  themselves.  Finally,  a 
major  effort  is  underway  to  develop  new  and 
expanded  reconstruction  strategies  and  the  lU 
procedures  required  to  support  them. 

1.3  Reconstruction  Strategies 

Our  focus  under  the  DARPA  APGD  program  will 
be  on  more  general  3D  reconstruction  strategies 
that  utilize  multiple  types  of  features  (points,  lines, 
surfaces),  and  that  can  be  applied  to  a  wide  range 
of  parameterized  building  classes.  The  system- 
level  approach  involves  multiple  alternative 
detection  and  reconstruction  strategies,  invoked  by 
clear  contextual  cues,  that  combine  a  wider  set  of 
algorithms  and  features  for  generating  and  fusing 
2D  and  3D  information.  The  strategies  being 
developed  will  be  utilize  both  monocular  optical 
data  as  well  as  digital  elevation  data  obtained  from 
IFSAR  or  multi-view  stereo  reconstruction  from 
EO  data. 

Several  new  reconstruction  strategies  are  being 
developed  for  this  system.  To  take  a  single 
example.  Ascender  I  rooftop  hypotheses  from  a 
single  optical  image  are  projected  into  registered 
elevation  data  and  used  to  trigger  the  application 
of  a  rooftop  model  matching  strategy  to  the 
restricted  subset  of  the  data.  The  model  matcher 
uses  a  knowledge  base  of  approximately  12 
parameterized  rooftop  models  (including  flat, 
peaked,  and  curved  roof  models)  and  matches  by 
correlating  a  histogram  of  surface  orientations 
derived  from  the  data  with  the  orientation  pattern 
of  the  model  surfaces  on  the  Gaussian  sphere. 

Initial  experiments  have  been  obtained  with  the 
Ascona/ISPRS  "Flat  Scene"  [ftp://ftp.ifp.uni- 
stuttgart.de/pub/wg3/].  This  scene  contains  several 
peaked  roofs  with  different  slopes  and  cluttered 
with  gabled  windows,  chimneys,  etc.  In  addition, 
the  elevation  data  contains  noise  unavoidably 
introduced  through  the  stereo  reconstruction 
process.  In  this  experiment,  the  top  two  models 
resulting  from  the  correlation  process  were 
selected  and  fit  to  the  constrained  elevation  data. 
The  model  with  the  lowest  residual  fit  error  was 
chosen  for  the  final  reconstruction.  Excluding  the 
six  rooftops  missed  by  the  hypothesis  generation 
phase,  the  remaining  eleven  rooftops  were 
correctly  classified  as  peaked  roof  buildings.  After 
surface  fitting  the  models  using  the  initial 
parameters  found  during  indexing,  the  average 
residual  fit  error  was  0.192  meters.  See  Jaynes  et 
al.  (1997)  in  these  proceedings  for  more  detail. 

1.4  Reconstruction  Strategies  from  IFSAR 

As  part  of  an  ORD-sponsored  feasibility  study, 
several  approaches  to  bottom-up  reconstruction  of 
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spatially  coherent  structures  from  IFSAR  data 
have  been  explored.  The  strategies  are  being 
extended  under  the  APGD  program  for  inclusion 
in  Ascender  II.  In  one  approach,  building 
footprints  extracted  from  optical  data  are  used  as  a 
focus  of  attention  mechanism  to  select  subsets  of 
the  IFSAR  data.  This  is  followed  by  the 
application  of  robust  3D  surface  fitting  techniques 
to  the  elevation  data.  In  a  second  approach, 
variations  on  traditional  region  growing  methods 
are  applied  to  the  IFSAR  data  alone.  Geometric 
constraints  can  be  imposed  during  region  growing 
to  produce  rectangular  or  rectilinear  shapes.  These 
ideas  have  been  explored  using  the  initial 
Sandia/Kirtland  AFB  dataset;  this  dataset  has  since 
been  characterized  as  being  particularly  noisy  with 
a  large  number  of  drop-outs  and  outliers.  This 
effort  is  described  in  more  detail  in  (Hoepfner  et 
al.  1997;  Jaynes  etal.  1997). 

1.5  Surface  Microstructure  Extraction 
from  Multiple  Aerial  Images 

Building  surfaces  with  microstructures  provides 
important  information  for  many  military  and 
civilian  applications.  The  extraction  of  small  scale 
information  from  aerial  imagery  is  difficult  due  to 
problems  caused  by  perspective  distortion,  data 
deficiencies,  and  shadows  and  occlusions.  A 
subsystem  has  been  developed  for  improved 
extraction  of  site  model  details,  often  at  a  scale 
close  to  the  limits  of  image  resolution. 

An  Orthographic  Facet  Image  Library  (OFIL) 
system  and  a  generic  window  and  door  extraction 
module  has  been  constructed  under  the  assumption 
that  an  initial  site  model  is  available,  and  sufficient 
camera  and  light  source  information  is  known.  The 
OFIL  sy.stem  is  designed  to  systematically  collect 
the  building  facet  intensities  from  multiple  aerial 
images  into  an  organized  orthographic  library, 
eliminate  the  effects  of  shadows  and  occlusions, 
and  combine  the  intensities  from  different  sources 
to  form  a  complete  and  consistent  intensity 
representation  for  each  facet.  A  'Best  Piece 
Representation'  algorithm  is  designed  to  combine 
intensities  from  multiple  views,  resulting  in  a 
unique  surface  intensity  representation.  The 
window  extraction  module  focuses  attention  on 
wall  facets,  attempting  to  extract  the  2-D  window 
and  door  patterns  attached  to  the  walls.  The 
algorithms  are  typically  useful  in  urban  sites. 
Experiments  show  suecessful  applications  of  this 
approach  to  site  model  refinement  and  improved 
fly-through  scene  visualizations;  details  can  be 
found  in  (Wang  et  al.  1997). 

2.  Terrest  Terrain  Reconstruction  System 

The  UMass  Terrain  Reconstruction  System 
(Schultz  1994)  deals  effectively  with  highly 
oblique  viewing  conditions  using  a  texture 
correlation  algorithm  that  incorporates  (1) 


hierarchical  unwarping,  (2)  weighted  cross¬ 
correlation  and  (3)  narrow  search  subpixel 
registration. 

Recent  extensions  of  Terrest  (Schultz,  Stolle  et  al. 
1997)  to  site  modeling  applications  involve  the 
incorporation  of  boundary  constraints  into  the 
correlation  masks.  The  windows  of  a  correlation 
mask  are  restricted  to  be  completely  on  either  side 
of  a  boundary,  thereby  causing  the  mask  to  be 
adaptive  to  the  context  in  which  it  is  applied.  For 
example,  with  building  roofs,  correlation  masks 
are  automatically  shaped  to  lie  entirely  inside  or 
outside  the  area  where  a  polygonal  boundary  has 
been  detected  (Quam  1984^  The  net  result  is 
significantly  sharpened  digital  elevation  maps 
(DEMs). 

2.2  Automated  Bundle  Adjustment 

An  effort  is  underway  to  automatically  select 
image  match  points  as  a  precursor  to  the  bundle 
adjustment  process  necessary  to  precisely  register 
images  and  to  compute  precise  relative  camera 
orientation.  The  approach  we  are  developing  uses 
building  corners  as  the  feature  points  of  choice. 
These  features  are  weighted  according  to  their 
distinguishabilty  and  correlation  masks  are 
generated  from  features  of  high  confidence.  We 
are  examining  the  question  of  whether  or,  not  the 
correlation  peaks  resulting  from  the  use  of  these 
features  during  matching  can  be  analyzed  in  order 
to  distinguish  between  true  correspondences  and 
false  matches. 

Since  the  degree  of  precision  obtained  in  the 
camera  pose  parameters  depends  strongly  on  the 
accuracy  of  the  match  point  locations,  a  robust 
estimation  technique  is  used  to  remove  outliers, 
i.e.  false  correspondences,  which  passed  the  local 
tests.  The  final  step  is  a  least-squares  technique  to 
obtain  the  final  relative  orientation.  Typically  20  to 
30  distinctive  points  are  required  for  accurate 
results. 

3.  Algorithms  to  Support  Force  Monitoring 

3.1  Using  Three-Dimensional  Features  to 
Improve  Terrain  Classification 

Image  texture  has  long  been  regarded  as  the  spatial 
distribution  of  gray-level  variation,  and  texture 
analysis  has  generally  been  confined  to  the  2-  D 
image  domain.  We  have  demonstrated  the  utility 
of  "3-D  world  texture"  as  a  function  of  3-D 
structures  (Wang  et  al.  1997)and  proposed  a  set  of 
3-D  textural  features.  The  proposed  3-D  features 
have  a  great  potential  in  terrain  classification. 
Experiments  have  been  carried  out  to  compare  the 
3-D  features  with  a  traditional  2-D  feature  set.  The 
results  show  that  the  3-D  features  significantly 
outperform  the  2-D  features  in  terms  of 
classification  accuracy  and  training  data  reliability. 
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The  classifications  have  been  used  to  generate 
ground  cover  maps  and  a  skeletal  road  network. 

3.2  Visibility  Analysis  for  Force  Monitoring 

Visibility  analysis  algorithms  have  been  developed 
for  a  variety  of  force  monitoring  scenarios 
including  stealth  path  planning,  placing  a  set  of 
observers  on  an  elevation  map  to  maximize  spatial 
coverage,  and  for  analyzing  when/where  a  force  of 
a  given  size  would  be  detected  over  a  given  line  of 
advance. 

Our  theoretical  work  is  based  on  the  Art  Galley 
Problem,  which  is  the  problem  of  determining  the 
number  of  observers  necessary  to  cover  an  art 
c^allery  such  that  every  point  is  seen  by  at  least  one 
observer  A  polynomial  time  solution  has  been 
developed  for  the  3-D  version  of  the  Art  Gallery 
Problem.  Because  the  problem  is  NP-hard,  the 
solution  presented  is  an  approximation,  and 
bounds  to  the  solution  are  presented.  Our  solution 
uses  techniques  from  computational  geonnetry. 
Graph  coloring  and  set  coverage.  A  complexity 
analysis  for  each  step  and  an  analysis  of  the 
overall  quality  of  the  solution  has  been  derived 
(Marengoni  et  al.  1996).  This  general  algorithm 
has  been  applied  to  several  problems  in  visibility 
analysis  on  an  elevation  map. 

4.  Content-based  Image  Indexing 

4.1  Center  for  Intelligent  Information 
Retrieval  (CIIR) 

The  Center  for  Intelligent  Information  Retrieval 
(CIIR)  conducts  leading  basic  research  in  the  area 
of  information  systems.  This  national  center  is  one 
of  only  four  centers  in  science  and  engineering  to 
be  funded  in  1992  by  the  National  Science 
Foundation  under  its  State/Industry  University 
Cooperative  Research  Centers  program.  One  of  the 
aoals  of  the  CIIR  is  to  develop  tools  that  provide 
effective  and  efficient  access  to  large, 
heterogeneous,  distributed,  text  and  multimedia 
databases.  A  new  partnership  between  the 
Computer  Vision  Lab  and  CIIR  is  focused  on 
content-based  multimedia  indexing  and  retrieval,  a 
difficult  yet  vitally  important  task.  The  aim  of 
content-based  retrieval  is  to  efficiently  find  images 
which  contain  the  object  represented  in  a  query 
image  in  a  large  database. 

4.2  Appearance-Based  Indexing  &  Retrieval 

A  system  to  retrieve  images  using  a  syntactic 
description  of  appearance  has  been  developed  and 
appears  in  these  proceedings  (Ravela  and 
Manmatha  1997).  A  multi-scale  invarianpector 
representation  of  images  in  the  database  is 
obtained  by  first  filtering  with  Gaussian  derivative 
filters  at  several  scales  and  then  computing  low 
order  differential  invariants;  this  done  off-line. 


Run-time  queries  are  designed  by  the  users  from 
an  example  image  by  selecting  a  set  of  salient 
regions.  The  responses  corresponding  to  these 
regions  are  matched  with  those  of  the  database  and 
a  measure  of  fitness  per  image  in  the  database  is 
computed  in  both  feature  space  and  coordinate 
space.  The  results  are  then  displayed  to  the  user 
sorted  by  the  match  score.  From  experiments 
conducted  with  over  1500  images  it  has  been 
shown  that  images  similar  in  appearance,  and 
whose  viewpoints  are  within  25  degrees  of  the 
query  image,  can  be  effectively  retrieved. 

4.3  Color-Based  Indexing  &  Retrieval 

A  new  multi-phase,  color-based  image  retrieval 
system  (FOCUS)  has  been  developed  which  is 
capable  of  identifying  multi-colored  query  objects 
in  an  image  in  the  presence  of  significant, 
interfering  backgrounds.  The  query  object  may 
occur  in  arbitrary  sizes,  orientations  and  locations 
in  the  database  images.  The  color  features  used  to 
describe  an  image  have  been  developed  based  on 
the  need  for  speed  in  matching  and  ease  of 
computation  on  complex  images  while 
maintaining  the  scale  and  rotation  invariance 
properties.  The  first  phase  matches  the  color 
content  of  an  image  with  the  query  object  colors 
using  an  efficient  indexing  mechanism.  The 
second  phase  matches  the  spatial  relationships 
between  color  regions  in  the  image  with  the  query 
using  a  spatial  proximity  graph  (SPG)  structure 
designed  for  the  purpose.  The  method  is  fast  and 
has  low  storage  overhead.  Test  results  with  multi¬ 
colored  query  objects  from  man-made  and  natural 
domains  show  that  FOCUS  is  quite  effective  m 
handling  interfering  backgrounds  and  large 
variations  in  scale  (Das  and  Riseman  1997). 

4.4  Text  Detection  &  Extraction  in  Images 

There  are  many  applications  in  which  the 
automatic  detection  and  recognition  of  text 
embedded  in  images  is  useful.  These  applications 
include  multimedia  systems,  digital  libraries,  and 
Geographical  Information  Systems.  However,  text 
is  often  printed  against  shaded  or  textured 
backgrounds  or  is  embedded  in  images.  Examples 
include  maps,  advertisements,  photographs, 
videos,  and  stock  certificates.  Current  OCR  and 
other  document  recognition  technology  cannot 
handle  these  situations  well. 

A  four-step  system  has  been  developed  that 
automatically  detects  and  extracts  text  from 
images  by  treating  it  as  a  texture  (Wu  et  al.  1997). 
Potential  text  locations  are  found  by  filtering 
second-order  derivatives  of  Gaussians  at  three 
different  scales.  Second,  vertical  strokes  from 
horizontally  aligned  text  regions  are  extracted 
Based  on  several  heuristics,  such  as  height 
similarity,  spacing  and  alignment,  strokes  are 
grouped  into  tight  rectangular  bounding  boxes 
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around  text  strings.  These  steps  are  then  applied  to 
a  pyramid  of  images  generated  from  the  input 
images  in  order  to  detect  text  over  a  wide  range  of 
font  sizes,  and  then  the  boxes  are  fused  at  the 
original  resolution.  In  a  third  step,  the  background 
is  cleaned  up  and  the  image  is  converted  to  binary. 
Finally,  text  bounding  boxes  are  refined  (repeating 
steps  2  and  3)  by  using  the  extracted  items  as 
strokes.  The  final  output  produces  two  binary 
images  for  each  text  box  which  can  then  be  passed 
to  any  standard  OCR  software. 

The  system  has  been  tested  on  images  from  a  wide 
variety  of  sources,  including  newspapers, 
magazines,,  photographs,  digitized  video  frames, 
etc.  Of  the  21820  characters  and  4406  words  in 
these  test  images,  95%  of  the  characters  and  93% 
of  the  words  have  been  successfully  extracted  by 
the  system.  Of  these  14703  characters  and  2981 
words  are  believed  to  be  OCR-readable  fonts,  and 
84%  of  the  characters  and  77%  of  the  words  are 
successfully  recognized  by  a  commercial  OCR 
system. 

5.  Learning  in  Vision 

5.1  New  formulation  of  Control  Policies 
based  on  Markov  Decision  Processes  and 
Reinforcement  Learning 

The  original  paradigm  for  SLS  was  to  use  decision 
trees  (or  other  classifiers)  to  evaluate  intermediate 
data  results  at  each  level  of  representation,  and  to 
use  some  mechanism  to  choose  how  to  transform 
data  from  one  level  of  representation  to  the  next. 
We  developed  a  model  for  applying  reinforcement 
learning  to  object  recognition  in  which  each  level 
of  representation  is  viewed  as  a  continuous  feature 
space  defined  by  its  measurable  attributes,  and 
control  policies  are  learned  using  a  combination  of 
reinforcement  learning  and  neural  networks  that 
map  points  in  the  feature  spaces  onto  optimal 
actions.  An  initial  implementation  of  this  new 
approach  was  completed  in  December,  1995.  In 
the  first  major  test  of  this  system,  it  was  lO-for-10 
in  recognizing  rooftops  in  aerial  images  of  Ft. 
Hood,  TX.  These  results  (which  use  a  reduced 
library  of  visual  procedures)  were  reported  in 
(Draper  1996)..  Prof.  Bruce  Draper  has  joined 
Colorado  State  University  and  will  continue 
developing  this  work  in  reinforcement  learning  in 
applications  related  to  the  Automatic  Population  of 
Geospatial  Databases  (APGD)  program. 

5.2  Real-time  interactive  classiflcation 

Manual  generation  of  training  examples  for 
supervised  learning  is  an  expensive  process.  One 
way  to  reduce  this  cost  is  to  produce  training 
instances  interactively  that  are  highly  informative. 
The  feasibility  of  such  an  approach  has  been 
demonstrated  on  an  image  pixel  classification  task 
that  is  the  front-end  to  many  higher  level  reasoning 


applications  that  can  make  useful  inferences  about 
the  contents  of  the  image.  However,  the 
construction  of  pixel  classifiers  is  a  labor-intensive 
task  involving  user  interaction  to  manually  select 
feature  sets,  manually  select  local  training  data  for 
each  desired  object  class,  and  then  to  provide 
feedback  as  a  result  of  classification  for  additional 
refinement  until  satisfactory  global  classification  is 
achieved. 

Thus,  the  prototype  system  we  have  implemented 
is  an  exploration  into  a  new  classification 
paradigm.  We  have  developed  a  prototype 
interactive  tool  (Piater  and  Utgoff  1997)  that 
allows  the  user  to  immediately  see  the  result  of 
selecting  incremental  training  data  so  that  he  can 
adjust  the  further  selection  on  the  basis  of 
inaccurate  classification.  This  system  shows  that 
the  incremental  classifier  converges  to  satisfactory 
performance  after  a  very  small  number  of  training 
instances  and  required  only  a  fraction  of  the 
typical  human  effort  to  provide  them.  This 
suggests  an  interactive  real-time  3D  visualization 
tool  for  incremental  classification  of  terrain  in 
aerial  images.  This  now  allows  interactive  training 
of  the  classifier  with  the  user  examining  the  world 
data  from  more  natural  and  understandable 
viewpoints  that  show  the  sensor  data  in  the  context 
of  its  three-dimensional  characteristics,  e.g.  from  a 
45  degree  downward  oblique  view,  where  sides  of 
objects  and  terrain  are  more  understandable. 
Classification  results  can  then  be  rapidly  overlaid 
onto  the  terrain  model  using  a  variety  of  graphic 
display  techniques,  and  with  incremental  real-time 
updating  of  training  data. 

6.  Miscellaneous  Related  Research 

6.1  Persistent  Data  Management  for  Visual 
Applications 

Visual  applications  need  to  represent,  manipulate, 
store,  and  retrieve  both  raw  and  processed  visual 
data.  Existing  relational  and  object-oriented 
database  systems  fail  to  offer  satisfactory  visual 
data  management  support  because  they  lack  the 
kinds  of  representations,  storage  structures, 
indices,  access  methods,  and  query  mechanisms 
needed  for  visual  data.  We  have  previously  argued 
that  extensible  visual  object  stores  offer  feasible 
and  effective  means  to  address  the  data 
management  needs  of  visual  applications  (Draper 
1993;  Draper).  Such  a  visual  object  store  is  under 
development  at  the  University  of  Massachusetts 
for  the  management  of  persistent  visual 
information.  ISR4  is  designed  to  offer  extensive 
storage  and  retrieval  support  for  large,  complex 
sets  of  visual  data  ,  customizable  buffering  and 
clustering,  and  spatial  and  temporal  indexing, 
along  with  a  variety  of  multi-dimensional  access 
methods  and  query  languages. 
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6.2  Segmentation  of  Stroke  Lesions  in  MRI 

A  collaborative  exploratory  project  with  Baystate 
Medical  Center  is  underway  for  analysis  of  stroke 
lesions  in  the  brain  scans  (Piater  1996).  The  goal 
of  this  clinical  study  is  the  volumetric  analysis  of 
damaged  cells  for  people  who  have  suffered  an 
acute  ischemic  stroke,  and  their  response  over  time 
to  various  forms  of  treatment  involving  the 
lowering  of  blood  pressure  in  the  period 
immediately  following  the  stroke.  This  requires 
the  segmentation  of  brain  lesions  where  there  is 
generally  a  core  of  dead  tissue  (infarct)  and  a 
surrounding  area  of  damaged  tissue  that  either 
might  recover  or  die  (penumbra).  The  change  in 
the  size  of  the  lesion  over  a  varying  period  of  time 
(several  days,  weeks,  and/or  months)  will  be 
correlated  with  qualitative  assessment  of  patient 
functionality  and  the  different  forms  of  treatment 

6.3  Weighted  Bipartite  Matching  for  3D 
Correspondence  and  Rigid  3-D  Motion 

A  closed  form  solution  has  been  developed  for  the 
problem  of  determining  correspondences  between 
two  sets  of  3D  points  for  which  the  number  of 
points  in  the  sets  is  not  the  same  (Cheng,  Wu  et  al. 
1996).  This  is  the  general  3D  rigid  motion  problem 
and  the  solution  is  based  on  a  decomposition  of  the 
correlation  matrix  eigenstructure.  Using  a  heuristic 
measure  of  point  pair  affinity  derived  from  the 
eigenstructure,  a  weighted  bipartite  matching 
algorithm  has  been  developed  to  determine  the 
correspondences  in  general  cases  where  missing 
points  occur.  The  use  of  the  affinity  heuristic  also 
leads  to  a  fast  outlier  removal  algorithm,  which 
can  be  run  iteratively  to  refine  the  correspondence 
recovery. 
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Abstract 

The  construction  of  large-scale  geospatial  databases 
remains  an  expensive  and  time  consuming  task.  We 
describe  an  approach  for  automatically  extracting 
certain  types  of  linear  features  with  a  horizontal  ex¬ 
tent  significantly  less  than  the  resolution  of  the  base- 
level  terrain  data  covering  the  area  in  which  these 
stmctures  occur.  The  first  phase  of  this  effort  con¬ 
siders  two  classes  of  such  features;  ravines  and  road 
cuts  and  fills.  Accurate  detection  and  localization  of 
these  features  is  difficult  in  even  high-resolution  el¬ 
evation  data.  While  they  are  often  apparent  in  aerial 
imagery,  they  are  easily  missed  or  confused  with 
other  features,  making  reliable  detection  based  on 
imagery  alone  problematic.  The  key  to  solving  this 
problem  is  to  utilize  techniques  which  combine  pho- 
togrammetric  analysis  of  the  terrain  with  focused 
image  understanding  methods  applied  to  individual 
images. 

1  Overview 

Sensor  technology,  limitations  of  photogramme- 
try,  storage  constraints,  and  requirements  for  real¬ 
time  rendering  and  analysis  all  limit  the  fidelity 
with  which  terrain  can  effectively  be  represented 
in  a  geospatial  database.  For  certain  applications, 
it  is  critical  that  these  databases  include  specific 
micro-terrain  features  with  a  lateral  extent  less  than 
the  resolution  of  base-level  terrain  description.  In 
current  generation  terrain  modeling  systems,  topo¬ 
graphic  micro-terrain  is  seldom  present.  Man-made 
micro-terrain,  particularly  that  associated  with  road 
cuts,  is  sometimes  included  but  often  does  not  cor- 

This  work  was  supported  by  the  Defense  Advanced  Re¬ 
search  Projects  Agency,  contract  number  pending. 


rectly  correspond  to  the  actual  terrain. 

This  project  demonstrats  how  feature  extraction 
methods  which  combine  image  imderstanding  with 
a  terrain  analysis  based  on  other  sources  of  infor¬ 
mation  can  be  used  to  reliably  locate  micro-terrain 
features  in  a  way  that  improves  the  utility  of  terrain 
models  used  for  simulation  and  synthetic  environ¬ 
ment  applications.  Our  initial  focus  is  on  two  types 
of  embankment  features  which  are  of  particular  rel¬ 
evance  to  DOD: 

•  Extraction  of  natural  ravines. 

•  Extraction  of  road  cuts  and  fills. 

Embankments  are  terrain  features  that  critically  af¬ 
fect  the  realism  of  ground  warfare  simulations. 
They  provide  concealment  while  also  constituting 
potential  barriers  to  traversibility.  Great  difficulty  is 
associated  with  accurately  extracting  embankments 
from  source  data.  Embankments  are  long,  narrow 
features  with  a  width  typically  much  smaller  that 
the  spatial  resolution  of  the  terrain  data  used  to  con¬ 
struct  geospatial  databases.  While  they  are  often  ap¬ 
parent  in  high  resolution  aerial  imagery,  reliable  de¬ 
tection  is  difficult  and  the  visual  signature  of  an  em¬ 
bankment  is  easily  confused  with  other  commonly 
occurring  terrain  and  cultural  features.  As  a  re¬ 
sult,  enhancing  the  realism  of  such  features  in  ter¬ 
rain  databases  currently  requires  substantial  manual 
processing. 

We  are  addressing  these  problems  with  an  approach 
aimed  at  achieving  the  following  objectives: 

•  Ability  to  accommodate  a  variety  of  terrain 
types  and  covers. 
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•  Tolerance  of  source  data  of  variable  quality  and 
uncertain  pedigree. 

•  Ability  to  exploit  existing  tools. 

•  Easy  insertion  of  results  into  existing  tools. 

Our  interest  lies  in  cartographic  features  that  are 
too  small  to  be  extracted  using  stereo  photogram- 
metry  and  too  indistinct  to  be  found  reliably  using 
standard  image  understanding  methods  alone.  Our 
approach  combines  existing  methods  for  extract¬ 
ing  terrain  data  and  features  with  standard  image 
processing  algorithms  in  order  to  extract  informa¬ 
tion  not  available  from  either  source  alone.  While 
micro-features  are  absent  from  terrain  models  pro¬ 
duced  using  conventional  photogrammetric  means, 
the  presence  of  such  features  can  often  be  inferred 
from  even  low  resolution  elevation  maps. 

Ravines  are  erosional  features  generated  by  large- 
scale  processes,  even  if  the  final  effect  is  visible 
mostly  on  a  fine-scale.  As  a  result,  hydrological 
analysis  can  be  used  to  predict  where  such  erosion 
is  most  likely  to  occur.  Effective  algorithms  exist 
for  performing  this  analysis  even  on  low-resolution, 
error- full  representations  of  the  terrain  skin. 

Except  for  unmaintained  tracks,  road  construction 
involves  local  terrain  modifications.  When  a  road 
crosses  a  slope  transversely,  the  side-to-side  cross- 
section  of  the  road  must  be  maintained  at  or  near 
the  horizontal.  For  roads  going  up  or  down  a  steep 
slope,  switchbacks  and/or  road  cuts  and  fills  are  of¬ 
ten  introduced  to  improve  tratficability.  2-D  infor¬ 
mation  about  road  position,  obtained  from  available 
sources,  can  be  combined  with  a  3-D  analysis  of  the 
local  terrain  to  produce  predictions  about  where  cuts 
and  fills  would  have  been  utilized  in  the  road  build¬ 
ing  process. 

As  shown  in  Figure  1,  our  approach  uses  predic¬ 
tions  about  the  possible  existence  and  approximate 
locations  of  ravine  and  road  cut  features  to  drive  a 
top-down,  model-guided  image  understanding  pro¬ 
cess.  The  image  understanding  component  con¬ 
firms  the  hypothesized  presence  of  terrain  features 
and  refines  positional  estimates,  rather  than  perform 
a  bottom-up  extraction  of  the  features  themselves. 
The  restricted  nature  of  this  task  allows  the  use  of 
simple  and  reliable  image  analysis  methods  that  are 
tolerant  of  wide  variability  in  input  data. 

While  the  intial  scope  of  this  project  is  limited  to  a 
set  of  feature-specific  methods,  the  features  them- 


Figure  1:  Combining  terrain  analysis  and  image 
understanding  for  micro-terrain  feature 
extraction. 

selves  are  often  critical  to  the  utility  of  the  entire 
geospatial  database,  particularly  when  used  in  sup¬ 
port  of  Synthetic  Environment  (SE)  applications. 
Existing  tools  are  not  able  to  extract  these  features 
without  major  human  effort.  The  solutions  outlined 
here  depend  on  the  use  of  “context”  to  direct  image 
understanding  methods  and  to  integrate  the  results 
of  that  analysis  into  a  consistent  terrain  database. 
Our  hope  is  that  the  approach  will  generalize  to  ad¬ 
ditional  classes  of  features  for  which  the  ways  in 
which  the  features  interact  with  the  surrounding  ter¬ 
rain  provide  enough  information  to  guide  targeted 
lU  analyses. 

2  Micro-Features  in  the  Context  of 
Modeling  and  Simulation  (M&S) 
Applications. 

Real-time  modeling  and  simulation  applications  are 
increasingly  emphasizing  realism  in  both  appear¬ 
ance  and  behavior  (Figure  2).  The  current  state- 
of-the-art  in  terms  of  fielded  systems  is  typified  by 
the  Army’s  Close  Combat  Tactical  Trainer  (CCTT) 
[Pope  et  al,  1995].  While  based  on  real  source  data, 
CCTT  is  a  geotypical  simulation,  appropriate  for 
generic  training  but  not  for  gaining  experience  with 
a  specific  geographic  area.  As  image  generators  and 
other  computational  engines  involved  in  implement¬ 
ing  a  simulation  become  more  powerful,  it  becomes 
increasingly  important  to  be  able  to  produce  richer 
and  more  accurate  models  of  the  relevant  3-D  ter¬ 
rain  structure  and  to  update  these  models  as  changes 
occur  in  the  world  they  are  representing.  Thus,  ex¬ 
traction  and  representation  of  veridical  features  is 
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Figure  2:  Trends  in  real-time  simulation. 


increasingly  essential. 

One  of  the  ways  that  next  generation  image  gen¬ 
erators  such  as  the  Evans  &  Sutherland  Harmony 
system  achieve  improved  visual  realism  is  by  drap¬ 
ing  actual  aerial  imagery  over  the  terrain  skin,  rather 
than  using  geotypical  texture  mapping.  Imagery  is 
becoming  part  of  the  primary  source  data,  rather 
than  serving  solely  as  a  secondary  data  source  on 
which  photogrammetry  and  feature  extraction  is 
performed.  While  the  visual  impact  can  be  dra¬ 
matic,  it  is  critical  that  features  in  the  world  model 
such  as  buildings,  roads,  and  critical  topography  be 
represented  and  localized  in  a  manner  consistent 
with  the  draped  imagery  used  for  rendering.  Im¬ 
age  Understanding  methods  are  thus  likely  to  play 
a  much  more  central  role  in  future  terrain  database 
creation  activities. 

It  is  important  that  new  approaches  to  the  construc¬ 
tion  of  geospatial  databases  provide  improvements 
in  economy  —  measured  in  cost  and  time  —  as  well 
as  in  the  quality  of  the  resulting  model.  High- 
resolution  IFSAR  and  high-resolution,  highly  cali¬ 
brated  photogrammetric  stereo  is  expensive  to  col¬ 
lect  and  process  and  is  not  likely  to  be  available 
prior  to  need  for  many  geographic  areas  of  poten¬ 
tial  interest  to  the  M&S  community.  In  time-critical 
situations,  the  use  of  terrain  feature  extraction  meth¬ 
ods  which  depend  on  such  data  may  force  signif¬ 
icant  delays  in  the  model-building  process  as  the 
needed  data  is  acquired.  As  a  result,  it  is  important 
to  understand  the  minimum  necessary  source  data 
quality  required  to  determine  relevant  information 
and  to  build  data  extraction  tools  able  to  compen¬ 
sate  for  deficiencies  in  existing  source  data. 


Figure  3:  Shallow  wash  located  within  Range  400. 

3  Ravine  Extraction 

While  ravines  and  dry  washes  are  usually  at  least 
partially  visible  on  aerial  photographs,  accurate 
detection  and  localization  is  difficult.  In  addi¬ 
tion,  ravines  are  easily  confused  with  roads,  tracks, 
and  other  structures  commonly  appearing  in  non- 
urban  environments.  Photogrammetry  fails  to  ex¬ 
tract  many  ravine  features  because  of  their  restricted 
depth  and  small  width  relative  to  the  resolution  with 
which  terrain  elevation  is  extracted.  In  addition, 
the  photogrammetric  overlaps  usually  used  for  non- 
urban  terrain  preclude  the  ability  to  see  into  many 
ravine  bottoms  or  measure  the  slopes  of  their  sides. 

The  dry  wash  shown  in  Figure  3  is  located  in 
the  live  fire  range  (Range  400)  of  the  USMC 
Air  Ground  Combat  Center,  located  at  Twentynine 
Palms,  CA.  The  wash  is  lm-2m  deep  and  2m-'3m 
across.  Though  very  small  compared  to  the  res¬ 
olution  at  which  terrain  features  are  usually  mod¬ 
eled,  such  ravines  are  of  critical  tactical  significance 
to  dismounted  infantry,  and  therefore  all  units  in  a 
combat  force. 

The  nominal  resolution  of  elevation  data  on  which 
terrain  models  are  based  is  commonly  on  the  order 
of  30m  or  greater.  Due  to  the  smoothing  inherent 
in  the  manner  in  which  the  elevation  data  is  ob¬ 
tained,  the  effective  resolution,  measured  in  terms 
of  the  size  of  distinct  features  apparent  in  the  data, 
is  much  coarser.  Thus,  even  with  DEM  data  finer 
than  a  30m  grid,  terrain  structure  such  as  shown  in 
Figure  3  is  likely  to  go  unrepresented.  Neverthe¬ 
less,  coarse  resolution  DEMs  can  be  used  to  predict 
likely  locations  where  smaller  terrain  deformations 
are  to  be  expected. 

Hydrological  analysis  based  on  digital  elevation 
models  is  now  a  standard  fimction  in  many  ge¬ 
ographic  information  systems  (GIS).  When  com¬ 
bined  with  appropriate  resampling  and  interpolation 
methods,  such  operations  can  be  used  to  effectively 
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Figure  4:  Hydrologic  flow  analysis  based  on  30m 
DEM  of  Range  400,  Overlayed  on  Or¬ 
thoimage. 


Figure  5:  Shaded  relief  of  Range  400  based  on 
high-resolution  DEM. 


overhead  view 


I  \ _ 

extracted  ravine 
location 


terrain  analysis 
from  30m  DEM 


"ground  truth" 
from  Im  DEM 


Figure  6:  Combining  terrain  analysis  and  lU  for 
ravine  extraction. 


ability  to  compare  the  results  of  augmenting  stan¬ 
dard  resolution  terrain  data  with  image  understand¬ 
ing  techniques  with  results  based  on  high  preci¬ 
sion  (and  expensive)  photogrammetry  (e.g.,  [Rich- 
bourg  et  al,  1995,  Richbourg  and  Olson,  1996, 
Henderson  et  al.,  1997]). 

Figure  6  illustrates  our  approach.  On  the  upper  right 
of  the  figure  is  a  section  of  the  Range  400  orthoim¬ 
age  corresponding  to  a  120m  by  150m  area  on  the 
ground.  In  the  upper  left  is  the  output  of  a  Canny 
edge  detector  applied  to  this  image.  The  lower  right 
shows  the  same  area,  displayed  as  a  shaded  relief 
rendering  of  the  high  resolution  DEM  data.  Essen¬ 
tially,  this  indicates  the  “ground  truth”  topography. 
The  lower  left  indicates  the  results  of  applying  au¬ 
tomated  ravine  estimation  to  a  30m  resolution  DEM 
covering  the  full  Range  400  area.  The  center  of 
the  figure  illustrates  how  image  and  terrain  analysis 
can  be  combined  for  ravine  extraction  (see  [Thoe- 
nen  and  Thompson,  1997]). 


estimate  the  existence  and  location  of  ravine  fea¬ 
tures  using  DEM  data  with  a  post  spacing  signifi¬ 
cantly  greater  than  the  width  of  the  features  of  in¬ 
terest.  Figure  4  shows  the  results  of  applying  such 
a  hydrological  analysis  to  a  30m  DEM  of  a  portion 
of  Range  400,  with  the  results  overlayed  on  top  of  a 
higher-resolution  aerial  image  of  the  same  area. 

One  of  the  advantages  of  using  Range  400  as  a 
demonstration  area  is  that  high  quality  DEM  data 
is  available  with  a  post  spacing  of  Im  and  a  nom¬ 
inal  precision  of  0.1m.  For  example.  Figure  4 
can  be  compared  with  Figure  5,  which  shows  a 
shaded  relief  rendering  of  the  same  area,  based 
on  the  higher  resolution  DEM.  This  provides  the 


4  Road  Cut  and  Fill  Extraction 

For  a  simulation  to  accurately  reflect  tactical  real¬ 
ity,  it  is  important  to  be  able  to  reliably  determine 
whether  or  not  road  cuts  and  fills  need  to  be  added 
to  road  segments  in  a  geospatial  database.  Fig¬ 
ure  7  shows  a  view  generated  from  the  U.S.  Army 
Close  Combat  Tactical  Trainer  (CCTT)  Central  U.S. 
database  (CCTT  Primary  One).  The  road  cut,  which 
appears  as  a  long  trench  cut  through  a  hillside,  was 
generated  by  an  automatic  cut  and  fill  insertion  tool 
that  first  drapes  the  road  onto  the  terrain  skin  and 
then  inserts  cuts  and  fills  whenever  needed  to  keep 
the  road  grade  from  exceeding  a  preset  steepness. 
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Figure  7:  Incorrect  road  cut  appearing  in  CCTT 
Primary  One  due  to  incorrect  automatic 
insertion  tools. 

It  is  one  of  a  large  number  of  road  cuts  and  fills  in 
CCTT  Primary  One  that  are  obviously  incorrect  to 
anyone  viewing  the  database.  While  improved  cut 
and  fill  insertion  techniques  might  reduce  the  inci¬ 
dence  of  implausible  configurations  such  as  shown 
here,  there  is  ho  way  to  accurately  represent  what  is 
actually  on  the  ground  without  revisiting  the  source 
data. 

In  most  geospatial  database  construction  systems, 
information  about  the  terrain  skin  and  the  location 
of  roads  is  independently  determined  and  then  rep¬ 
resented  in  separate  layers  in  a  GIS.  A  merging 
operation  must  then  be  performed  to  create  a  ter¬ 
rain  model  that  is  realistic  and  behaves  in  a  manner 
consistent  with  the  desired  semantics  of  the  simula¬ 
tion.  Road  networks  are  almost  always  initially  ex¬ 
tracted  as  2-D  features,  with  no  information  explic¬ 
itly  available  about  superelevation  or  grade.  Three 
general  approaches  are  possible  for  adding  this  in¬ 
formation  to  produce  a  full  3-D  representation  of  the 
road  surface  (Figure  8):  The  road  can  be  “draped” 
over  the  terrain,  adapting  to  the  terrain  surface;  the 
terrain  can  be  locally  deformed  so  as  to  make  the 
road  surface  plausible  and  to  blend  the  road  and  ter¬ 
rain  in  a  natural  manner;  or  explicit  cuts  and  fills  can 
be  introduced  to  deal  with  discrepancies  between 
the  desired  geometry  of  the  road  surface  and  the 
shape  of  the  underlying  terrain. 

As  simulations  move  from  geotypical  to  geospecific 
databases,  it  becomes  increasingly  important  to  de¬ 
termine  not  just  where  road  cuts  and  fill  are  plau¬ 
sible,  but  where  they  actually  occur.  Our  proposed 


Figure  8:  Road  cut  insertion  requires  merging  2-D 
road  locations  with  the  3-D  terrain  skin 
by  (a)  introducing  an  explicit  road  cut, 
(b)  modifying  the  terrain  skin  to  accom¬ 
modate  the  road,  or  (c)  draping  the  road 
over  the  terrain. 

method  for  detecting  road  cuts  and  fills  will  proceed 
as  follows: 

•  Search  in  the  imagery  for  roads  present  in  the 
road  corridor  layer  of  the  source  data  GIS.  This 
step  is  similar  to  the  image  analysis  operations 
used  for  ravine  extraction. 

•  Use  civil  engineering  principles  applied  to  road 
corridor  data  draped  over  the  base-level  terrain 
skin  to  predict  where  cuts  and  fills  are  likely  to 
occur. 

•  Search  along  the  sides  of  roads  in  located  in 
the  imagery,  looking  for  textural  variations  that 
correspond  to  the  pattern  of  predicted  cuts  and 
fills.  The  key  is  to  search  for  patterns  of  vari¬ 
ability,  since  the  actual  visual  appearance  of 
cuts  and  fills  is  impossible  to  predict. 

5  Integration  With  Existing  Tools 

To  be  cost  effective,  methods  developed  as  part  of 
this  project  must  be  easily  integrated  with  other  as¬ 
pects  of  the  geospatial  database  creation  process. 
Micro-feature  extraction  methods  should  be  able  to 
utilize  the  full  range  of  existing  geospatial  database 
tools.  The  results  of  these  micro-feature  extraction 
methods  should  be  easily  used  by  the  full  range  of 
existing  geospatial  database  tools. 
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Figure  9:  Integrating  micro-feature  extraction  into 
established  procedures  for  building  ter¬ 
rain  models. 

Geospatial  database  tools  use  a  much  broader  range 
of  data  formats  than  do  most  “pure”  image  under¬ 
standing  systems.  Imagery,  elevation  data  and  other 
specifications  of  terrain  skin,  linear  features,  and 
3-D  structures  must  all  be  represented  in  geospe¬ 
cific  coordinate  systems.  In  the  initial  phase  of  this 
project,  all  code  we  develop  will  be  interfaced  to 
the  Arc/Info™  GIS  software.  While  Arc/Info  is  a 
proprietary  system,  it  is  likely  to  be  used  by  anyone 
building  production-grade  geospatial  databases. 

Figure  9  shows  how  programs  we  develop  fit  into 
the  larger  context  of  geospatial  database  production. 
Modeling  tools  in  common  usage  take  in  source 
data;  organize,  analyze,  and  process  it  with  the  help 
of  tools  such  as  Arc/Info™  and  GDE  s  Socet  Set  , 
and  create  models  using  a  language  appropriate  to 
the  simulation  or  mission  plaiming  systems  required 
by  the  end  user.  By  coupling  our  system  to  Arc/Info, 
we  facilitate  compatibility  with  any  model  genera¬ 
tion  system  that  uses  Arc/Info  as  a  support  tool. 

6  Evaluation 

Too  often,  image  understanding  methods  are  eval¬ 
uated  in  isolation  instead  of  in  the  context  of  the 
larger  systems  that  they  are  typically  embedded 
within.  The  lU  methods  described  in  this  pa¬ 
per  should  be  viewed  as  adding  value  to  exist¬ 
ing  geospatial  database  construction  processes.  For 
evaluation  purposes,  we  will  start  with  state-of-the- 
art  simulation  databases,  which  often  have  been 
constructed  at  substantial  expense.  These  databases 
will  have  known  deficiencies  which  affect  critical 
aspects  of  the  realism  of  simulations.  The  end  re¬ 
sult  of  the  methods  we  have  described  above  will  be 
to  improve  these  existing  terrain  databases.  Thus, 
the  appropriate  evaluation  process  involves  compar¬ 


ing  the  databases  before  and  after  the  processing  we 
propose.  This  allows  for  an  operational  assessment 
by  letting  end  users  interact  with  simulations  involv¬ 
ing  the  original  and  improved  terrain  databases. 

Given  believable  ground  truth  data,  quantitative 
measures  can  also  be  developed.  In  the  case  of 
Range  400,  the  availability  of  Im  DEM  data  will  aid 
this  process.  Quantitative  evaluation  will  be  based 
on  how  well  lU  methods  combined  with  lower- 
resolution  DEMs  can  approximate  the  ravine  fea¬ 
tures  apparent  in  the  higher-resolution  DEMs.  Rich- 
bourg’s  work  in  developing  computational  tech¬ 
niques  for  analyzing  concealment  features  in  the 
Range  400  data  provides  a  valuable  starting  point 
[Richbourg  et  al,  1995,  Richbourg  and  Olson, 
1996].  The  NIST-TEC-LADS  project  investigating 
landform  extraction  from  the  Range  400  data  is  also 
relevant. 
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Abstract 

Colorado  State  University  is  initiating  a  new 
project  on  learning  control  strategies  for  object 
recognition.  It  is  our  belief  that  the  current 
library  of  lU  algorithms  is  sufficient  for  solv¬ 
ing  many  practical  tasks  if  we  can  only  learn 
to  sequence  them  properly.  We  are  investigat¬ 
ing  the  use  of  open-loop  and  closed-loop  control 
pwicies  for  sequencing  lU  algorithms,  empha¬ 
sizing  the  use  of  Markov  decision  models  and 
reinforcement  learning  to  derive  closed-loop  ob¬ 
ject  recognition  policies.  This  work  is  being  con¬ 
ducted  in  the  context  of  the  Automatic  Popula¬ 
tion  of  Geospatial  Databases  lAPGD)  project, 
where  it  will  be  used  to  learn  ooject  recognition 
strategies  for  finding  buildings,  roads  and  other 
objects  of  interest  in  aerial  images. 

1  Introduction 

Image  understanding  (lU)  research  at  Colorado 
State  University  (CSU)  is  based  on  two  simple 
premises: 

•  lU  researchers  have  made  tremendous 
progress  in  developing  algorithms  for  many 
aspects  of  the  computer  vision  problem, 
including  feature  extraction,  shape  recon¬ 
struction  and  model  matching. 

•  Despite  this  progress,  there  are  very  few 
practical  lU  systems  because  they  are  too 

This  work  was  sponsored  by  the  Defense  Advanced 
Research  Projects  Agency  (DARPA)  Image  Understand¬ 
ing  Program  under  contract  96-14-201  monitored  by  the 
Army  Topographic  Engineering  Laboratory  (TEC),  and 
contracts  DAAH04-93-G-422  and  DAAH04-95-1-0447, 
monitored  by  the  U.  S.  Army  Research  Office 


difficult  to  build  and  too  brittle  with  re¬ 
spect  to  changes  in  the  domain  or  task 
statement. 

Based  on  these  observations,  researchers  at  CSU 
are  developing  recognition  strategies  for  the 
DARPA  Automatic  Population  of  Geospatial 
Databases  (APGD)  program  based  on  a  new 
approach.  Instead  of  hand  developing  new  algo¬ 
rithms  for  specific  lU  tasks,  we  approach  lU  as 
a  control  problem.  Our  hypothesis  is  that  many 
practical  lU  problems  can  be  solved  through 
existing  techniques,  provided  we  can  learn  to 
select  the  proper  sequence  of  algorithms.  Our 
goal  is  to  develop  the  technology  for  learning 
to  recognize  objects  from  examples  by  training 
control  strategies  that  select  sequences  of  lU  al¬ 
gorithms  based  on  the  object  to  be  recognized 
and  the  domain  context. 

2  Background 

Over  the  last  twenty  years,  computer  vision 
researchers  have  divided  the  field  into  ten  or 
twenty  (or  more)  topics,  each  with  a  narrowly- 
defined  problem  focus.  Within  these  subfields, 
theories  have  been  developed  and  specific  algo¬ 
rithms  have  been  proposed.  As  a  result,  there 
are  now  good  and  improving  algorithms  for  fea¬ 
ture  extraction  (including  points,  edges,  lines, 
curves  and  regions),  stereo  analysis,  multi- frame 
feature  tracking,  depth  from  motion  (two-frame 
and  multi- frame) ,  shape  matching,  color  match¬ 
ing,  pixel  matching  (a.k.a.  appearance  match¬ 
ing),  and  3D  pose  determination,  and  new  algo- 
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rithms  are  being  developed  as  this  is  written. 

At  the  same  time,  other  researchers  (particu¬ 
larly  within  the  DARPA  community)  have  con¬ 
centrated  on  building  end-to-end  systems  that 
solve  specific  lU  tasks.  For  example,  in  recent 
years  competing  systems  have  been  developed 
for  recognizing  and  reconstructing  buildings 
from  aerial  images  [McGlone  &  Shufelt,  1993; 
Lin  et  al,  1994;  Jaynes  et  al,  1996],  while  other 
systems  have  been  built  for  recognizing  mili¬ 
tary  targets,  road  networks  and  terrain  features. 
These  systems  can  be  viewed  as  the  intellectual 
descendents  of  earlier  knowledge-based  systems 
that  exploited  knowledge  about  objects  and  do¬ 
mains  to  create  special-purpose  object  recog¬ 
nition  strategies  (e.g.  [McKeown  et  al,  1985; 
Hwang  et  al,  1986;  Draper  et  al,  1989]). 

If  we  take  a  close  look  at  the  recent  work 
on  building  reconstruction  [McGlone  &  Shufelt, 
1993;  Lin  et  al,  1994;  Jaynes  et  al,  1996],  we 
can  draw  several  conclusions; 

•  These  systems  were  built  by  sequencing 
standard  lU  algorithms  for  line  extraction, 
line  grouping,  shadow  analysis  and  graph 
matching/traversal.  (Some  of  these  sys¬ 
tems  refined  previous  algorithms  in  these 
areas.) 

•  These  systems  are  special-purpose  to  the 
extent  that  they  reconstruct  a  single  type 
of  object  within  a  specific  domain.  They 
will  not  work  if  the  target  object  class  or 
domain  is  changed  significantly. 

•  It  is  difficult  to  analyze  these  systems. 
There  is  no  underlying  theory  by  which  to 
gauge  their  performance,  nor  is  there  an 
analytical  method  for  predicting  at  what 
point  they  will  break.  It  is  difficult  even 
to  compare  them,  since  each  system  makes 
slightly  different  assumptions  about  the  im¬ 
agery. 

•  These  systems  are  difficult  to  build.  Each 
system  took  months  or  even  years  of  highly 
skilled  labor  to  construct,  and  even  so  most 
of  these  systems  could  be  improved  given 
more  time. 

Our  project  is  an  effort  to  generalize  from  ear¬ 
lier  object  recognition  efforts.  Like  these  pre¬ 


vious  systems,  om  goal  is  to  sequence  existing 
lU  algorithms  in  order  to  recognize  specific  ob¬ 
jects  within  limited  contexts.  Unlike  previous 
efforts,  we  will  neither  construct  these  algorithm 
sequences  by  hand  nor  build  a  “knowledge  base” 
with  rules  for  selecting  algorithms  (such  knowl¬ 
edge  bases  have  proven  error-prone  and  diffi¬ 
cult  to  build  in  the  past;  see  [Draper  k  Hanson, 
1991]).  Instead,  we  will  model  lU  as  a  control 
problem,  in  which  the  goal  is  to  select  the  best 
sequence  of  algorithms  for  recognizing  a  given 
object  class.  More  specifically,  we  model  the  lU 
control  problem  as  a  Markov  decision  process, 
in  which  the  goal  is  to  train  a  control  policy 
that  selects  algorithms  so  as  to  maximize  the 
expected  reward  over  time,  where  the  reward 
is  a  weighted  function  of  cost  versus  accuracy. 
Our  aim  is  a  system  that  is  general-purpose  and 
robust  in  the  sense  that  it  can  be  retrained  to 
recognize  a  wide  variety  of  objects  in  various  do¬ 
mains,  and  theoretically  sound  in  the  sense  that 
it  will  converge  to  the  optimal  control  policy  as 
the  training  set  size  increases. 

3  Relevance  to  APGD 

To  the  extent  that  we  are  successful,  our  tech¬ 
nology  for  learning  object  recognition  strategies 
could  be  used  within  the  APGD  context  for 
context-based  algorithm  control.  We  imagine 
a  system  in  which  image  analysts  begin  to  pop¬ 
ulate  a  database  by  outlining  and  labeling  ob¬ 
jects  of  interest  in  images.  As  the  analysts  work, 
the  system  uses  these  labeled  object  instances 
as  training  samples  for  learning  object  recogni¬ 
tion  policies.  When  the  strategy  for  an  object 
class  is  trained,  the  system  will  take  over  for  the 
analyst  and  automatically  label  any  remaining 
instances  of  the  object  class  and  enter  these  in¬ 
stances  into  the  geospatial  database.  Moreover, 
as  new  images  are  acquired  the  system  will  au¬ 
tomatically  label  objects  instances  and  update 
the  database. 

Through  continued  training,  a  system  which 
learns  strategies  could  in  principle  adapt  to  rec¬ 
ognize  any  visually  distinct  object  class.  More¬ 
over,  because  the  system  will  be  based  on  a 
Markov  decision  process  model  and  reinforce¬ 
ment  learning,  there  are  analytical  reasons  to 
believe  that  the  control  policies  it  learns  will  be 
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sound;  this  is  not  true  of  current  hand-crafted 
systems.  Finally,  because  the  reinforcement 
learning  process  makes  a  series  of  predictions 
about  the  intermediate  data  it  generates,  the 
system  should  be  able  to  detect  when  its  con¬ 
trol  policies  are  failing  (perhaps  because  of  a 
new  variant  of  an  object  class  or  because  of  a 
change  in  the  domain)  and  to  ask  the  image  ana¬ 
lyst  for  new  training  samples,  rather  than  make 
possibly  disastrous  mistakes. 

4  Image  Understanding  as  a  Control 
Problem 

Underlying  this  work  is  the  notion  of  lU  as  a 
control  problem.  In  casting  lU  as  a  control 
problem,  we  make  the  assumption  that  there  is 
a  library  of  available  lU  algorithms  (such  as  the 
ones  mentioned  in  Section  2),  and  that  these  al¬ 
gorithms  are  sufficient  to  solve  many  interesting 
lU  problems.  We  also  assume  that  the  system 
is  given  an  lU  task  of  the  form  “Find  the  X 
in  Y”,  where  X  is  an  object  to  be  recognized 
and  Y  is  a  set  of  images;  examples  of  X  in  the 
APGD  context  include  buildings,  roads,  trees 
and  power  lines,  while  an  example  of  Y  might 
be  EO  nadir-view  images  of  a  constrained  ge¬ 
ographic  area  such  as  Bosnia  or  Somalia.  The 
challenge  for  the  control  system  is  to  find  a  se¬ 
quence  of  algorithms  (ideally,  the  best  sequence 
of  algorithms)  for  finding  the  target  object  in 
the  given  domain. 

Readers  should  note  that  we  propose  to  control 
algorithms,  not  physical  devices  such  as  cameras 
or  platforms.  Other  researchers  have  addressed 
physical  camera  control  in  the  context  of  active 
vision  research.  Also,  there  is  a  harder  version 
of  the  lU  problem  in  which  the  object  being 
searched  for  is  unspecified  (i.e.  “What’s  in  the 
image,  and  where  is  it?”).  This  form  of  the 
problem  includes  the  object  indexing  problem, 
which  is  outside  of  the  scope  of  this  work. 

4.1  Open-,  vs.  Closed-loop  Control 

Once  the  decision  is  made  to  cast  lU  as  a  control 
problem,  several  interesting  questions  emerge. 
The  first  of  these  questions  is  whether  the  sys¬ 
tem  is  an  open-loop  or  closed-loop  control  sys¬ 


tem.  In  open-loop  control,  the  system  selects  a 
fixed  sequence  of  actions  for  each  task.  Open- 
loop  control  systems  have  the  advantage  that 
recognition  strategies  can  be  easily  expressed  as 
sequences  of  algorithms,  and  open-loop  policies 
can  be  learned  by  searching  the  space  of  algo¬ 
rithm  sequences;  Brown  &:  Roberts  [1994],  for 
example,  use  genetic  algorithms  to  search  for 
the  best  sequence  of  algorithms  for  a  specific 
automatic  target  recognition  (ATR)  task. 

On  the  other  hand,  open-loop  control  systems 
have  the  disadvantage  that  they  are  unable  to 
adjust  to  unexpected  events.  For  example,  one 
strategy  for  finding  buildings  in  aerial  images 
might  be  to  first  extract  building  corners,  where 
the  camera  viewpoint  determines  the  expected 
image  angle.  Unfortunately,  if  the  corner  de¬ 
tector  fails  in  an  open-loop  system  (perhaps  be¬ 
cause  of  an  error  in  the  estimated  viewpoint) 
the  open-loop  strategy  will  continue  to  apply 
the  remaining  algorithms  in  the  sequence,  even 
though  the  data  produced  by  the  first  step  was 
erroneous.  Closed-loop  systems,  on  the  other 
hand,  do  not  produce  explicit  sequences  of  ac¬ 
tions.  Instead,  they  select  an  algorithm  at  each 
stage  of  the  process  based  on  the  results  of 
the  previous  processing  step.  This  gives  closed- 
loop  systems  the  ability  to  react  to  unexpected 
events  during  processing,  for  example  by  back¬ 
tracking  and  selecting  another  algorithm. 

More  formally,  closed-loop  control  strategies  are 
defined  by  policies  that  map  states  of  the  sys¬ 
tem  onto  actions  (i.e.  algorithms),  where  sys¬ 
tem  states  are  determined  by  measuring  feature 
attributes.  Closed-loop  control  is  more  powerful 
than  open-loop  control,  and  it  is  our  belief  that 
variations  among  images  within  a  domain  and 
the  inherent  unreliability  of  many  lU  algorithms 
imply  that  closed-loop  policies  will  be  needed 
for  robust  control.  This  is  still  a  hypothesis, 
however,  and  one  of  our  tasks  in  this  project 
will  be  to  compare  closed-loop  strategies  learned 
through  reinforcement  learning  with  open-loop 
strategies  learned  through  search. 

4.2  Expert  Knowledge  vs.  Machine 
Learning 

Another  question  is  whether  control  decisions 
come  from  reasoning  about  expert  knowledge 
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or  whether  they  are  learned  automatically  from 
experience.  ^Ve  believe  that  although  many 
variations  on  expert  systems  have  been  tried 
for  lU  (e.g.  SPAM  [McKeown  et  al,  1985], 
SIGMA  [Hwang  et  al,  1986],  PSEIKI  [Andress 
&  Kak,  1988],  the  Schema  System  [Draper  et 
al,  1989]  and  more  recently  OCAPI  [Clement 
&  Thonnat,  1993]),  they  have  always  proven  to 
be  difficult  to  build  and  harder  to  extend.  More¬ 
over,  even  when  they  produce  acceptable  results 
on  a  limited  set  of  images,  it  is  difficult  to  tell 
whether  the  control  system  has  performed  well 
or  not.  We  base  our  conclusions  in  part  on  our 
own  past  work.  In  [Draper  &  Hanson,  1991] 
we  used  the  Schema  System  to  illustrate  prob¬ 
lems  inherent  in  the  hand-construction  of  expert 
knowledge  bases  for  lU,  while  in  [Draper  et  al, 
1996]  we  give  a  broader  scope  to  these  issues. 

The  alternative  is  to  build  a  system  that  learns 
control  strategies  based  on  examples.  Brown 
&  Roberts  [1994]  is  one  example  of  a  system 
that  learns  open-loop  control  policies  bases  on 
examples  ([Chen  &  Mulgaonkar,  1992]  is  an¬ 
other).  We  propose  to  build  on  our  earlier  work 
[Draper,  1996a;  Draper,  1996b]  by  building  a 
system  that  uses  reinforcement  learning  to  au¬ 
tomatically  acquire  closed-loop  control  policies 
from  examples.  (In  related  but  different  work, 
[Peng  &  Bhanu,  1996]  used  reinforcement  learn¬ 
ing  to  select  parameters  for  lU  algorithms.). 

It  is  our  contention  that  in  the  long  run  machine 
learning  is  the  best  source  of  control  strategies. 
A  fielded  APGD  system,  for  example,  must  be 
able  to  adapt  to  new  object  classes  in  new  do¬ 
mains.  A  system  requiring  expert  modification 
and  recertification  for  each  new  object  or  do¬ 
main  is  not  acceptable  to  the  military  (or  in¬ 
deed  to  any  user).  Instead,  it  is  om  goal  to  show 
that  robust  closed-loop  control  policies  for  ob¬ 
ject  recognition  can  be  learned  by  observing  an 
expert. 


4.3  General-purpose  vs.  Object- 
specific  Attributes 

A  critical  issue  for  closed-loop  control  systems 
is  generating  object-specification  attribute  mea¬ 
sures  for  intermediate  data  representations.  In 
a  closed-loop  system,  a  control  policy  selects 


actions  (i.e.  algorithms)  based  on  the  current 
state  of  the  system.  The  system  state  in  turn  is 
a  reflection  of  attributes  that  can  be  measured 
for  the  features  that  have  been  extracted  up  to 
that  point.  For  example,  in  the  hypothetical 
closed-loop  control  policy  for  recognizing  build¬ 
ings  mentioned  in  Section  4.1,  the  first  action 
was  to  extract  corners  from  the  image  data.  The 
second  action  was  then  selected  based  on  the 
number  and  quality  of  corners  found  in  the  first 
step. 

Clearly,  closed-loop  control  policies  can  only 
outperform  open-loop  action  sequences  if  the  at¬ 
tributes  of  the  image  features  provide  meaning¬ 
ful  feedback.  In  earlier  experiments  on  learn¬ 
ing  to  recognize  buildings  we  provided  a  system 
with  routines  for  computing  sophisticated  fea¬ 
ture  attributes,  including  an  algorithm  which 
measured  how  much  of  a  shadow  a  feature  cast 
(based  on  the  known  camera  viewpoint).  This 
attribute  proved  to  be  critical;  as  reported  in 
[Draper,  1996a]  (page  1453),  the  number  of  false 
alarms  detected  by  the  trained  system  dropped 
significantly  when  the  shadow  attribute  was  in¬ 
troduced. 

In  this  project,  we  intend  to  have  the  system  de¬ 
velop  meaningful  attributes  on  its  own.  Some 
of  these  attributes  will  be  learned,  while  oth¬ 
ers  will  be  deduced  from  a  priori  models.  At¬ 
tributes  may  be  derived  from  many  levels  of  rep¬ 
resentation,  including  image  properties,  such  as 
color  and  texture,  and  object  geometry.  For 
instance,  we  will  extend  our  earlier  work  with 
linear  machine  decision  trees  to  learn  the  ap¬ 
parent  color  of  objects  in  outdoor  imagery  [Bu- 
luswar  &  Draper,  1994]  to  train  attributes  that 
match  image  features  to  the  expected  textures 
(and  if  available,  colors)  of  objects.  We  will  also 
build  on  current  work  for  matching  features  de¬ 
rived  from  geometric  object  models.  Prior  ex¬ 
amples  include  our  past  work  on  matching  hori¬ 
zons  derived  from  terrain  maps  to  imagery  [Bev¬ 
eridge  &  Balasubramanian,  1997]  and  our  multi¬ 
sensor  system  for  predicting  observable  features 
of  CAD  vehicle  models  [Stevens  &  Beveridge, 
1997].  The  specific  feature  sets  developed  in 
these  examples  do  not  carry  over  directly  to 
the  APGD  domain.  However,  the  principles 
employed  by  the  algorithms  generating  these 
features  are  applicable  to  APGD.  We  will  also 
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be  expanding  our  work  on  a  set  of  algorithms 
for  learning  probe  sets  and/or  eigenvalue  mea¬ 
sures  from  example  features  extracted  from  im¬ 
ages  [Stevens  et  ai,  1997]. 

5  Evaluating  Recognition  Strategies 

In  order  to  evaluate  our  ability  to  learn  closed- 
loop  object  recognition  policies,  we  will  apply 
our  system  to  the  APGD  Fort  Hood  dataset^ 
and  test  its  ability  to  recognize  objects  of  strate¬ 
gic  interest.  In  particular,  we  will  begin  by 
training  recognition  policies  to  find  buildings 
and  roads.  Then  we  will  test  how  easily  the 
system  can  adapt  to  new  tasks  by  training  it  to 
recognize  two  more  object  classes,  to  be  deter¬ 
mined  jointly  by  ourselves  and  representatives 
of  the  Army  Topographic  Engineering  Center 
(TEC).  Finally,  we  will  adopt  a  new  dataset 
from  a  different  domain  to  see  how  easily  the 
system  adapts  from  one  setting  to  another. 

In  general,  when  evaluating  our  system,  the 
quality  of  a  control  policy  will  be  measured  by 
a  utility  function  that  balances  accuracy  and 
cost.  (The  relative  weight  of  accuracy  vs.  cost 
is  determined  by  the  user  prior  to  training.)  To 
measure  the  effectiveness  of  the  learning  system, 
however,  we  must  seperate  the  performance  of 
the  control  policy  from  the  quality  of  the  un¬ 
derlying  lU  algorithms.  To  do  this,  we  will 
compare  control  policies  against  two  standards. 
The  first  standard  is  the  result  of  an  exhaus¬ 
tive  search  of  the  space  of  open-loop  strate¬ 
gies.  (We  can  compute  this  because  the  space 
of  open-loop  strategies  is  much  smaller  than 
the  space  of  closed-loop  policies.)  Closed-loop 
policies  trained  through  reinforcement  learning 
will  then  be  compared  to  the  optimal  open-loop 
strategy  according  to  the  user-defined  utility 
function.  Second,  we  will  compare  closed-loop 
policies  to  each  other.  Although  there  is  no  way 
to  know  what  the  true  optimal  closed-loop  pol¬ 
icy  for  a  given  task  may  be,  if  we  train  multi¬ 
ple  closed-loop  policies  we  can  compare  them  to 
each  other,  determining  which  are  the  best  (and 
by  how  much). 

^For  readers  unfamiliar  with  the  Fort  Hood  data,  it 
is  a  collection  of  approximately  twenty  high-resolution 
black-and-white  aerial  images  of  Fort  Hood,  TX,  includ¬ 
ing  approximate  camera  parameters  for  each  image. 


6  Practicum:  Khoros  and  the  lUE 

At  a  more  mundane  level,  work  on  lU  as  a  con¬ 
trol  problem  can  only  proceed  if  libraries  of  lU 
algorithms  are  accessible.  One  of  the  goals  of 
the  Image  Understanding  Environment  (lUE) 
[Mundy  et  ai,  1992]  is  to  disseminate  libraries 
oflU  “task”  objects,  which  are  implementations 
of  lU  algorithms.  To  this  end,  we  have  been  ac¬ 
tively  contributing  to  the  lUE  effort.  Our  most 
recent  contribution  is  a  target  detection  algo¬ 
rithm  for  use  in  IR  imagery.  This  algorithm  was 
selected  as  an  archetypical  ATR  algorithm  and 
it  is  based  loosely  on  the  concepts  of  a  sliding 
window  detector  set  out  by  Nguyen  [Nguyen, 
1990];  it  is  of  practical  interest  because  it  is 
used  as  the  first  phase  of  a  two  phase  target 
detection  algorithm  on  the  Unmanned  Ground 
Vehicle  Program’s  Semi-Autonomous  Scout  Ve¬ 
hicles.  We  are  also  developing  a  version  of  our 
optimal  line  segment  matching  system  for  re¬ 
lease  with  the  lUE. 

At  the  same  time,  we  are  forced  to  recognize  the 
limited  state  of  the  current  lUE  task  library.  We 
will  therefore  be  developing  our  learning  sys¬ 
tem  within  the  Khoros  image  processing  envi¬ 
ronment  [Rasure  &  Kubica,  1994]  which  at  the 
moment  has  a  more  extensive  library  of  (mostly 
low-level)  computer  vision  algorithms,  and  we 
will  be  extending  this  library  as  necessary.  For¬ 
tunately,  long-term  plans  call  for  the  lUE  to  be¬ 
come  compatible  with  Khoros,  and  it  is  our  hope 
to  be  able  to  access  both  the  lUE  and  Khoros 
task  libraries  in  the  relatively  near  future. 

7  Recent  Accomplishments 

Although  the  main  thrust  of  this  paper  is  to  look 
forward  to  our  project  on  learning  object  recog¬ 
nition  policies,  we  thought  it  would  be  useful 
to  briefly  summarize  some  of  our  recent  accom¬ 
plishments,  and  in  particular  to  expose  some 
underlying  intellectual  connections  between  our 
previous  work  and  our  new  project. 

7.1  Automatic  Target  Recognition 

Over  the  past  three  years  we  have  devoted 
considerable  attention  to  the  development  of 
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Table  1:  Confusion  matrix  for  Multisensor  Tar¬ 
get  Identification.  Correct  identification  rate  is 
27/35  (77%).  The  two  entries  marked  with 
are  cases  where  hypothesis  generation  failed  to 
suggest  the  correct  target  type. 
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new  model-based  ATR  techniques  for  multisen¬ 
sor  imagery.  This  work  was  supported  by  the 
DARPA  lU  Program  as  part  of  the  Unmanned 
Ground  Vehicle  (UGV)  program’s  RSTA  (Re¬ 
connaissance,  Surveillance  and  Target  Acquisi¬ 
tion)  activity.  A  detailed  report  on  this  effort 
appears  in  [Beveridge  et  al,  1997a];  Here  we 
discuss  only  those  portions  of  the  work  that  are 
relevant  to  the  APGD  project. 

The  CSU  ATR  system  was  a  three-stage  tar¬ 
get  detection  and  recognition  system  that  per¬ 
formed  well  on  the  difficult  Ft.  Carson  data  set 
[Beveridge  et  al,  1994].  On  35  target  identifica¬ 
tion  tasks  involving  4  targets,  the  CSU  system 
correctly  identified  27  out  of  35  (77%)  of  the 
targets.  If  we  neglect  difficult  cases,  such  as 
distant  and  occluded  targets,  the  correct  identi¬ 
fication  rate  is  over  90%.  The  confusion  matrix 
summarizing  this  result  is  presented  in  Table  1. 

As  a  point  of  comparison,  the  group  from  MIT 
Lincoln  Laboratory  has  used  the  Ft.  Car- 
son  dataset  in  part  of  the  evaluation  of  their 
own  range-based  ATR  system  [Verly  Sz.  Lacoss, 
1997].  Based  upon  their  performance  model¬ 
ing  work,  they  conclude  their  approach  is  only 
applicable  to  four  of  the  highest  resolution  Ft. 
Carson  range  images. 

Although  some  aspects  of  the  CSU  ATR  system 
are  tailored  to  ATR,  some  of  the  technology  is 
applicable  to  the  APGD  task.  The  first  stage  of 
the  ATR  system  used  a  linear  machine  decision 


tree  to  learn  the  range  of  apparent  colors  exhib¬ 
ited  by  an  object  under  outdoor  lighting  condi¬ 
tions  [Buluswar  &  Draper,  1994].  Although  the 
Ft.  Hood  dataset  used  for  the  current  APGD 
work  does  not  include  color  data,  this  same 
learning  technique  can  be  used  to  learn  com¬ 
binations  of  texture  measures  extracted  from 
black  and  white  imagery.  How  well  an  image 
feature  matches  the  expected  texture  of  an  ob¬ 
ject  then  becomes  an  attribute  that  a  closed- 
loop  control  system  can  use  for  feedback. 

The  second  stage  of  the  CSU  ATR  system  used 
probing  techniques  to  suggest  possible  targets 
and  target  orientations.  The  probing  techniques 
derived  probe  sets  from  BRL/CAD  models  of 
possible  targets,  and  we  developed  new  tech¬ 
niques  based  on  neural  networks  for  efficiently 
selecting  the  most  relevant  probesets  [Stevens 
et  al,  1997].  Once  again,  although  the  objects 
being  searched  for  in  the  APGD  domain  will  be 
different,  probing  techniques  such  as  these  can 
be  used  to  develop  object-specific  feature  at¬ 
tributes  whenever  either  object  models  or  sub¬ 
stantial  training  imagery  are  available.  We  have 
also  begun  to  explore  the  intellectual  connec¬ 
tions  between  probing  and  eigenspace  analy¬ 
sis  [Nayar  et  al,  1996;  Kirby  k  Sirovich,  1990; 
Turk  k  Pentland,  1991],  and  are  looking  for 
ways  to  train  both  probe  sets  and  eigenspace 
representations  from  the  APGD  training  sam¬ 
ples. 

Finally,  the  third  stage  of  the  CSU  ATR  system 
performed  the  final  target  identification  and  tar¬ 
get  pose  determination  by  matching  3D  tar¬ 
get  models  to  the  multisensor  image  data.  By 
exploiting  an  iterative  predict-and-match  cycle 
between  the  3D  object  model  and  the  multi¬ 
sensor  image  data,  we  have  demonstrated  what 
we  consider  to  be  several  significant  advances  in 
the  state-of-the-art  for  ground-based  multisen¬ 
sor  ATR.  We  have  demonstrated  an  ability  to 
take  a  rough  target  pose  estimate,  i.e.  off  by 
as  much  as  30°,  and  generate  a  more  reliable 
estimate  accurate  to  within  about  5°  [Stevens 
k  Beveridge,  1997].  Further,  this  is  done  in 
the  presence  of  errors  in  the  initial  registration 
mappings  between  sensors:  our  algorithm  re¬ 
fines  sensor  registration  and  3D  target  pose  as 
part  of  the  matching  process. 
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Although  this  work  might  at  first  seem  unre¬ 
lated  to  the  current  APGD  effort,  this  work  im¬ 
plies  that  our  learning  systems  will  have  access 
to  state-of-the-art  algorithms  for  matching  ge¬ 
ometric  models  to  data.  Moreover,  it  demon¬ 
strates  an  ability  to  propagate  evidence  for  ter¬ 
rain  occlusion  between  sensors  and  accordingly 
modify  range,  IR  and  color  target  features  dur¬ 
ing  the  matching  cycle.  This  means  that  the 
CSU  ATR  system  does  not  try  to  find  features 
which  it  can  infer  are  occluded  by  foreground 
terrain.  Such  an  ability  to  reason  about  why 
certain  features  might  not  be  seen  will  be  criti¬ 
cal  to  control  systems  that  must  distinguish  be¬ 
tween  features  that  are  missing  (or  may  have 
changed),  and  those  which  simply  cannot  be 
seen  from  the  current  viewpoint. 

8  Conclusion 

The  current  focus  of  lU  research  at  Colorado 
State  University  is  on  understanding  the  impli¬ 
cations  of  modeling  lU  as  a  control  problem, 
and  on  building  practical  systems  that  learn  ob¬ 
ject  recognition  strategies  from  examples.  This 
work  is  being  conducted  in  the  context  of  the 
Automatic  Population  of  Geospatial  Databases 
(APGD)  project,  where  it  will  be  used  to  learn 
object  recognition  strategies  for  finding  build¬ 
ings,  roads  and  other  objects  of  interest  in  aerial 
images. 
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Abstract 

In  this  paper,  we  examine  the  use  of  ma¬ 
chine  learning  to  improve  the  robustness 
of  systems  for  image  analysis  on  the  task 
of  roof  detection.  We  review  the  problem 
of  analyzing  aerial  photographs,  and  de¬ 
scribe  an  existing  vision  system  that  at¬ 
tempts  to  automate  the  identification  of 
buildings  in  aerial  images.  After  this,  we 
briefly  review  several  well-known  learning 
algorithms  that  represent  a  wide  variety  of 
inductive  biases.  We  report  three  experi¬ 
ments  designed  to  illuminate  facets  of  ap¬ 
plying  machine  learning  methods  to  the  im¬ 
age  analysis  task;  one  experiment  focuses 
on  within-image  learning,  another  deals 
with  the  cost  of  different  errors,  and  a  third 
addresses  between-image  learning.  Exper¬ 
imental  results  demonstrate  that  machine- 
learned  classifiers  meet  or  exceed  the  ac¬ 
curacy  of  handcrafted  solutions  and  that 
useful  generalization  occurs  when  training 
and  testing  on  data  derived  from  different 
images. 

1  Introduction 

The  number  of  images  available  to  image  analysts 
is  growing  rapidly,  and  will  soon  outpace  their  abil¬ 
ity  to  process  them.  Computational  aids  will  be  re¬ 
quired  to  filter  this  flood  of  images  and  focus  the 
analyst’s  attention  on  interesting  events,  but  cur¬ 
rent  image  understanding  systems  are  not  yet  robust 
enough  to  support  this  process.  Successful  image 

‘This  research  was  supported  by  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  under  grant  N00014- 
94-1-0746,  administered  by  the  Office  of  Naval  Research. 


understanding  relies  on  knowledge,  and  despite  the¬ 
oretical  progress,  implemented  vision  systems  still 
rely  on  heuristic  methods  that  remain  fragile.  Hand¬ 
crafted  knowledge  about  when  and  how  to  use  par¬ 
ticular  vision  operations  can  give  acceptable  results 
on  some  images  but  not  others. 

In  this  paper  we  explore  the  use  of  machine  learn¬ 
ing  as  a  means  for  improving  knowledge  used  in  the 
vision  process,  and  thus  for  producing  more  robust 
software.  Recent  applications  of  machine  learning 
in  business  and  industry  [Langley  and  Simon,  1995] 
hold  useful  lessons  to  its  application  in  image  analy¬ 
sis.  A  key  idea  in  applied  machine  learning  involves 
building  an  advisory  system  that  recommends  ac¬ 
tions  but  gives  final  control  to  a  human  user,  with 
each  decision  generating  a  training  case,  gathered  in 
an  unobtrusive  way,  for  use  in  learning.  This  setting 
for  knowledge  acquisition  is  similar  to  the  scenario  in 
which  an  image  analyst  interacts  with  a  vision  sys¬ 
tem,  finding  some  system  analyses  acceptable  and 
others  uninteresting  or  in  error.  The  aim  of  our  re¬ 
search  program  is  to  embed  machine  learning  into 
this  interactive  process  of  image  analysis. 

This  adaptive  approach  to  computer  vision  promises 
to  greatly  reduce  the  number  of  decisions  that  im¬ 
age  analysts  must  make  per  picture,  thus  improv¬ 
ing  their  ability  to  deal  with  a  high  flow  of  images. 
Moreover,  the  resulting  systems  should  adapt  their 
knowledge  to  the  preferences  of  individual  users  in 
response  to  feedback  from  those  users.  The  over¬ 
all  effect  should  be  a  new  class  of  systems  for  image 
analysis  that  reduce  the  workload  on  human  analysts 
and  give  them  more  reliable  results,  thus  speeding 
the  image  analysis  process. 

In  the  sections  that  follow,  we  report  initial  progress 
on  using  machine  learning  to  improve  decision  mak¬ 
ing  at  one  stage  in  an  existing  image  understanding 
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system.  We  begin  by  explaining  the  task  domain  — 
identifying  buildings  in  aerial  photographs  —  and 
then  describe  the  vision  system  designed  for  this 
task.  Next  we  review  five  well-known  algorithms  for 
supervised  learning  that  hold  potential  for  improv¬ 
ing  the  reliability  of  image  analysis  in  this  domain. 
After  this,  we  report  the  design  of  experiments  to 
evaluate  these  methods  and  the  results  of  those  stud¬ 
ies.  In  closing,  we  consider  related  work  on  learning 
for  image  understanding  and  some  directions  for  fu¬ 
ture  research. 

2  Nature  of  the  Image  Analysis  Task 

The  image  analyst  interprets  aerial  images  of  ground 
sites  with  an  eye  to  unusual  activity  or  other  inter¬ 
esting  behavior.  The  images  under  scrutiny  are  usu¬ 
ally  complex,  involving  many  objects  arranged  in  a 
variety  of  patterns.  A  typical  image  from  the  Fort 
Hood  RADIUS  repository,  which  contains  satellite 
photographs  of  a  military  base,  includes  buildings  in 
a  range  of  sizes  and  shapes,  major  and  minor  road¬ 
ways,  sidewalks,  parking  lots,  vehicles,  and  vegeta¬ 
tion.  A  common  task  faced  by  the  image  analyst  is 
to  detect  change  at  a  site  as  reflected  in  differences 
between  two  images,  as  in  the  number  of  buildings, 
roads,  and  vehicles.  This  in  turn  requires  the  abil¬ 
ity  to  recognize  examples  from  each  class  of  interest. 
In  this  paper,  we  focus  on  the  performance  task  of 
identifying  buildings  in  satellite  photographs. 

Aerial  images  can  vary  across  a  number  of  dimen¬ 
sions.  The  most  obvious  factors  concern  viewing  pa¬ 
rameters,  such  as  distance  from  the  site  (which  af¬ 
fects  size  and  resolution)  and  viewing  angle  (which 
affects  perspective  and  visible  surfaces).  But  other 
variables  also  influence  the  nature  of  the  image,  in¬ 
cluding  the  time  of  day  (which  affects  contrast  and 
shadows),  the  time  of  year  (which  affects  foliage), 
and  the  site  itself  (which  determines  the  shapes  of 
viewed  objects).  Taken  together,  these  factors  in¬ 
troduce  considerable  variability  into  the  images  that 
confront  the  analyst. 

In  turn,  this  variability  can  significantly  compli¬ 
cate  the  task  of  recognizing  object  classes.  Al¬ 
though  a  building  or  vehicle  will  appear  different 
from  alternative  perspectives  and  distances,  the  ef¬ 
fects  of  such  transformations  are  reasonably  well  un¬ 
derstood.  But  variations  due  to  time  of  day,  the  sea¬ 
son,  and  the  site  are  more  serious.  Shadows  and  fo¬ 
liage  can  hide  edges  and  obscure  surfaces,  and  build¬ 
ings  at  distinct  sites  may  have  quite  different  struc¬ 
tures  and  layouts.  Such  variations  serve  as  mere 
distractions  to  the  human  image  analyst,  yet  they 
provide  serious  challenges  to  existing  computer  vi¬ 
sion  systems. 


This  suggests  a  natural  task  for  machine  learning: 
given  aerial  images  as  training  data,  acquire  knowl¬ 
edge  that  improves  the  reliability  of  such  an  image 
analysis  system.  However,  we  cannot  study  this  task 
in  the  abstract.  We  must  explore  the  effect  of  specific 
induction  algorithms  on  particular  vision  software. 
In  the  next  two  sections,  we  briefly  review  one  such 
system  for  image  analysis,  followed  by  five  learning 
methods  that  might  give  it  more  robust  behavior. 

3  An  Architecture  for  Image  Analysis 

Lin  and  Nevatia  [1996]  report  a  computer  vision  sys¬ 
tem  for  the  analysis  of  ground  sites  in  aerial  im¬ 
ages.  Like  many  programs  for  image  understanding, 
their  system  operates  in  a  series  of  processing  stages. 
Each  step  involves  aggregating  lower  level  features 
into  higher  level  ones,  eventually  reaching  hypothe¬ 
ses  about  the  locations  of  buildings.  We  will  consider 
these  stages  in  the  order  they  occur. 

Starting  at  the  pixel  level,  the  system  uses  an  edge 
detector  to  group  pixels  into  edgels,  and  then  in¬ 
vokes  a  line  finder  to  group  edgels  into  lines.  Junc¬ 
tions  and  parallel  lines  are  identified  and  combined 
to  form  three-sided  structures  or  “Us”.  The  algo¬ 
rithm  then  groups  selected  Us  and  junctions  to  form 
parallelograms.  Each  such  parallelogram  constitutes 
a  hypothesis  about  the  position  and  orientation  of 
the  roof  for  some  building,  so  we  may  call  this  step 
‘rooftop  generation’. 

After  the  system  has  completed  the  above  aggre¬ 
gation  process,  a  ‘rooftop  selection’  stage  evaluates 
each  hypothesis  to  determine  whether  that  candi¬ 
date  has  sufficient  evidence  to  be  retained.  The  aim 
of  this  process  is  to  remove  hypotheses  that  do  not 
correspond  to  actual  buildings.  Ideally,  the  system 
will  reject  most  spurious  hypotheses  at  this  point, 
although  a  final  ‘verification’  step  may  still  collapse 
duplicate  or  overlapping  rooftops.  This  stage  may 
also  exclude  hypotheses  if  there  exists  no  evidence 
of  three-dimensional  structure,  such  as  shadows  and 
walls. 

Analysis  of  the  system’s  operation  suggested  that 
rooftop  selection  held  the  most  promise  for  improve¬ 
ment  through  machine  learning,  because  this  stage 
must  deal  with  many  spurious  hypotheses.  This  pro¬ 
cess  takes  into  account  both  local  and  global  crite¬ 
ria.  Local  support  comes  from  features  such  as  lines 
and  corners  that  are  close  to  a  given  parallelogram. 
Since  these  suggest  walls  and  shadows,  they  provide 
evidence  that  the  hypothesis  corresponds  to  an  ac¬ 
tual  building.  Global  criteria  consider  containment, 
overlap,  and  duplication  of  hypotheses.  Using  these 
evaluation  criteria,  the  set  of  hypotheses  is  reduced 
to  a  more  manageable  size. 
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The  individual  constraints  applied  in  this  process 
have  a  solid  foundation  in  both  theory  and  practice. 
The  problem  is  that  we  have  only  heuristic  knowl¬ 
edge  about  how  to  combine  them.  Moreover,  such 
rules  of  thumb  are  currently  crafted  by  hand,  and 
they  do  not  fare  well  on  images  that  vary  in  their 
global  characteristics,  such  as  contrast  and  amount 
of  shadow.  However,  methods  from  machine  learn¬ 
ing,  to  which  we  now  turn,  may  be  able  to  induce 
better  conditions  for  selecting  or  rejecting  candidate 
roofs.  If  these  acquired  heuristics  are  more  accu¬ 
rate  than  the  existing  handcrafted  solutions,  they 
will  improve  the  reliability  of  the  rooftop  selection 
process. 

4  A  Review  of  Learning  Techniques 

We  can  formulate  the  task  of  acquiring  rooftop  se¬ 
lection  heuristics  in  terms  of  supervised  learning.  In 
this  process,  training  cases  of  some  concept  are  la¬ 
beled  as  to  their  class.  In  rooftop  selection,  only  two 
classes  exist  —  rooftop  and  non-rooftop  —  which  we 
will  refer  to  as  positive  and  negative  examples  of  the 
rooftop  concept.  Each  instance  consists  of  a  num¬ 
ber  of  attributes  and  their  associated  values,  along 
with  a  class  label.  These  labeled  instances  consti¬ 
tute  training  data  that  are  provided  as  input  to  an 
inductive  learning  routine,  which  generates  concept 
descriptions  designed  to  distinguish  the  positive  ex¬ 
amples  from  the  negative  ones.  These  knowledge 
structures  state  the  conditions  under  which  the  con¬ 
cept,  in  this  case  ‘rooftop’,  is  satisfied. 

For  this  study,  we  selected  five  well-known  learning 
methods:  Quinlan’s  [1993]  C4.5,  Clark  and  Niblett’s 
[1989]  CN2,  Nearest  neighbor  [e.g..  Aha  et  al.,  1992], 
Naive  Bayes  [e.g.,  Langley  et  al.,  1992],  and  Percep- 
tron  learning  [e.g.,  Zurada,  1992].  We  chose  these 
methods  because  they  represent  a  range  of  represen¬ 
tations,  performance  schemes,  and  learning  mecha¬ 
nisms  for  supervised  concept  learning.  In  addition, 
they  exhibit  different  inductive  biases,  meaning  that 
the  algorithms  acquire  certain  concepts  more  eas¬ 
ily  than  others.  As  a  result,  they  should  provide 
insights  about  the  types  of  machine  learning  algo¬ 
rithms  that  are  useful  for  the  rooftop  selection  task. 

C4.5  [Quinlan,  1993]  constructs  a  ‘decision  tree’  from 
training  data.  Each  nonterminal  node  in  such  a  tree 
specifies  an  attribute,  and  the  emanating  links  indi¬ 
cate  a  value  (or  range  of  values),  whereas  terminal 
nodes  specify  a  class  name.  Classification  occurs  by 
sorting  an  instance  downward  through  the  tree  until 
it  reaches  a  terminal  node.  Algorithms  for  induc¬ 
ing  decision  trees  operate  by  selecting  the  attribute 
whose  values  best  discriminate  among  the  classes, 
partitioning  the  training  data  into  subsets  for  each 
value  (or  range),  then  applying  the  process  in  turn 


to  each  of  the  resulting  subsets.  This  recursive  par¬ 
titioning  process  divides  the  training  data  into  ever 
smaller  sets,  until  each  set  contains  only  one  class  or 
no  further  splits  are  possible.  Some  variants,  includ¬ 
ing  C4.5,  include  a  ‘pruning’  stage  that  cuts  back  the 
tree  after  its  construction. 

CN2  [Clark  and  Niblett,  1989,  Clark  and  Boswell, 
1991]  learns  a  set  of  conjunctive  rules  from  training 
examples.  To  classify  an  unknown  instance,  CN2 
determines  which  rules  the  instance  satisfies.  When 
rules  from  different  classes  match  an  instance,  the 
system  uses  a  probabilistic  conflict  resolution  scheme 
to  select  the  single  best  rule.  CN2  uses  a  cover¬ 
ing  algorithm,  much  like  AQ  [Michalski,  1969],  that 
constructs  rules  one  at  a  time.  It  specializes  a  maxi¬ 
mally  general  description  until  it  finds  a  “best”  rule, 
as  determined  by  some  evaluation  criterion.  CN2 
removes  those  training  examples  covered  by  the  rule 
and  repeats  this  process  with  the  remaining  exam¬ 
ples,  creating  additional  rules  until  all  examples  have 
been  covered.  The  system  copes  with  continuous 
data  by  dividing  the  continuous  range  into  discrete 
sub-intervals. 

Another  approach  is  the  nearest  neighbor  method 
[e.g..  Aha  et  al.,  1991],  which  uses  an  ‘instance- 
based’  representation  of  knowledge  that  simply  re¬ 
tains  training  cases  in  memory.  This  approach  clas¬ 
sifies  new  instances  by  finding  the  ‘nearest’  stored 
case,  as  measured  by  some  distance  metric,  then 
predicting  the  class  associated  with  that  case.  For 
numeric  attributes,  a  common  metric  (which  we 
also  use  in  our  studies)  is  Euclidean  distance.  In 
this  framework,  learning  involves  nothing  more  than 
storing  each  training  instance,  along  with  its  asso¬ 
ciated  class.  Although  this  method  is  quite  simple 
and  has  known  sensitivity  to  irrelevant  attributes, 
in  practice  it  performs  well  in  many  domains.  Some 
versions  select  the  k  closest  cases  and  predict  the  ma¬ 
jority  class;  here  we  will  focus  on  the  ‘simple’  nearest 
neighbor  scheme,  which  uses  only  the  nearest  case. 

A  fourth  alternative  is  the  naive  Bayesian  classifier 
[e.g.,  Langley  et  al.,  1992],  which  stores  a  proba¬ 
bilistic  concept  description  for  each  class.  This  de¬ 
scription  includes  an  estimate  of  the  class  probability 
and  the  estimated  conditional  probabilities  of  each 
attribute  value  given  the  class.  The  method  classi¬ 
fies  new  instances  by  computing  the  probability  of 
each  class  using  Bayes’  rule,  combining  the  stored 
probabilities  by  assuming  that  the  attributes  are  in¬ 
dependent  given  the  class  and  predicting  the  class 
with  the  highest  probability.  Like  nearest  neighbor, 
naive  Bayes  has  known  limitations,  such  as  sensi¬ 
tivity  to  attribute  correlations  and  an  inability  to 
represent  multiple  decision  regions,  but  in  practice 
it  behaves  well  on  many  natural  domains. 
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Figure  1:  Visualization  interface  for  roof  hypotheses. 


Finally,  the  Perceptron  learning  algorithm  [e.g.,  Zu- 
rada,  1992]  finds  coefficients  and  a  threshold  for  a 
linear  discriminant  function.  The  algorithm  learns 
these  values  by  applying  a  hill-climbing  technique 
until  it  minimizes  the  error  between  the  desired  out¬ 
put  and  the  actual  output  of  the  linear  discriminant 
function.  To  classify  unknown  instances,  the  algo¬ 
rithm  simply  computes  the  weighted  sum  using  the 
learned  coefficients  and  the  attribute  values.  If  the 
weighted  sum  exceeds  the  threshold,  then  the  system 
assigns  the  instance  to  one  class;  otherwise,  it  assigns 
the  instance  to  the  other  class.  No  modifications  are 
needed  to  handle  continuous  data  (cf.  CN2).  This 
algorithm  is  well-studied  and  learns  the  same  type  of 
classifier  as  the  Lin/Nevatia  [1996]  rule,  which  also 
takes  the  form  of  a  linear  discriminant  function. 

5  Representation  and  Labeling  of 
Rooftop  Hypotheses 

To  apply  the  above  algorithms  to  the  problem  of 
roof  hypothesis  selection,  we  selected  two  data  sets 
derived  from  aerial  images  of  Fort  Hood,  Texas.  The 
site  contains  29  actual  buildings.  The  first  data  set, 
derived  from  image  FHOV1027,  consists  of  1179  roof 
hypotheses  generated  from  a  nadir  view  of  the  site; 
in  contrast,  the  second  data  set,  derived  from  image 
FHOV625,  contains  2193  hypotheses  produced  from 
an  oblique  view  taken  at  a  different  time.  Each  hy¬ 
pothesis  in  the  two  data  sets  is  described  in  terms  of 
nine  features  that  summarize  the  evidence  gathered 
from  lower  levels  of  analysis;  these  include  evalua¬ 
tion  of  edge  support,  corner  support,  parallel  sup¬ 
port,  orthogonal  trihedral  vertex  support,  shadow 
corner  support,  gap  overlap,  displacement  of  edge 
support,  crossing  lines,  and  existence  of  junctions. 
All  nine  features  take  on  continuous  values. 


Before  we  could  pass  the  hypotheses  to  a  learning 
algorithm,  we  first  had  to  label  each  one  as  either  a 
positive  or  negative  example  of  the  desired  concept. 
To  accomplish  this  task  easily,  we  implemented  a 
visualization  system,  shown  in  Figure  1,  that  dis¬ 
plays  each  roof  hypothesis  and  lets  the  user  label 
it  as  positive  or  negative.  There  are  two  problems 
with  this  approach.  First,  the  user  may  have  to  la¬ 
bel  thousands  of  hypotheses,  which  would  be  time- 
consuming  and  tedious.  To  reduce  the  number  of 
hypotheses  a  human  has  to  label,  we  implemented 
a  simple  pre-screening  algorithm  that  takes  user- 
identified  regions  of  interest  (i.e.,  areas  surrounding 
buildings)  and  determines  how  many  corners  of  a 
rooftop  hypothesis  fall  within  this  region.  For  ex¬ 
ample,  if  the  user  specified  that  two  corners  must 
fall  within  the  region,  then  those  hypotheses  with 
fewer  than  two  corners  are  labeled  automatically  as 
negative,  and  the  remaining  hypotheses  are  passed 
to  the  visualization  system  for  human  classification. 
For  image  FHOV1027,  the  pre-screening  step  re¬ 
duced  the  number  of  hypotheses  that  required  la¬ 
beling  from  1179  to  257. 

The  second  problem  is  that  the  visualization  sys¬ 
tem  displays  hypotheses  in  the  visual  space  and  not 
in  the  attribute  space.  Early  experiences  with  the 
visualization  system  showed  it  was  difficult  to  judge 
the  quality  of  hypotheses  simply  by  their  visual  char¬ 
acteristics  (i.e.,  the  set  of  four  lines  on  the  image). 
Consequently,  we  needed  a  feedback  mechanism  to 
show  how  the  hypothesis  looked  in  the  visual  and 
attribute  spaces. 

To  address  this  problem,  we  incorporated  a  simple 
learning  algorithm  into  the  visualization  system  that 
uses  a  nearest  neighbor  classifier  and  its  past  ex¬ 
perience  to  classify  a  new  hypothesis.  The  system 
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Table  1:  Experimental  results  for  within-image  learning  using  data  from  two  Fort  Hood  images. 


(a)  Image  FHOV1027 

(b)  Image  FHOV625 

Positive 

Accuracy 

Negative 

Accuracy 

Overall 

Accuracy 

Positive 

Accuracy 

Negative 

Accuracy 

Overall 

Accuracy 

Lin/Nevatia 

Naive  Bayes 
Nearest  neighbor 
C4.5  (w/pruning) 
CN2 

Perceptron 

48.80±1.4 

74.56±1.5 

60.08±1.9 

58.12±3.7 

47.84±1.4 

O.OOiO.O 

100.00±0.0 

83.84±0.7 

93.00±0.6 

94.00±0.5 

98.44±0.2 

lOO.OOiO.O 

82.91±0.6 

82.28±0.5 

87.50±0.5 

87.95±0.5 

89.99±0.4 

83.20±0.5 

69.64±1.8  . 
59.72±1.6 
54.32±1.9 
39.24±1.8 
O.OOiO.O 
1.60±0.9 

lOO.OOiO.O 

91.40±0.7 

94.00±0.3 

96.56±0.6 

lOO.OOiO.O 

lOO.OOiO.O 

91.58±0.3 

87.82±0.6 

89.63±0.4 

90.28±0.4 

89.20±0.3 

89.37±0.3 

displays  hypotheses  classified  to  be  roofs  as  green 
rectangles,  non-roofs  as  blue  rectangles,  and  atypi¬ 
cal  hypotheses  as  red  rectangles.  The  user  can  set  a 
“sensitivity  threshold”  that  affects  how  distant  from 
previous  instances  a  hypothesis  must  be  before  it  is 
labeled  atypical.  As  the  visualization  system  gains 
experience,  it  displays  fewer  and  fewer  atypical  hy¬ 
potheses.  After  the  system  classifies  and  displays 
a  hypothesis,  the  user  either  confirms  or  overrides 
the  system’s  decision.  By  the  end  of  the  labeling 
session,  the  user  typically  confirms  most  of  the  de¬ 
cisions  made  by  the  system. 

6  Within-image  Learning 

A  typical  machine  learning  experiment  manipulates 
one  or  more  independent  variables  and  evaluates 
the  effect  of  this  manipulation  by  measuring  one 
or  more  dependent  variables.  Since  our  hypothe¬ 
sis  was  that  machine  learning  would  produce  classi¬ 
fiers  that  perform  as  well  or  better  than  handcrafted 
knowledge,  the  natural  independent  variable  was 
the  classifier  used  to  label  hypotheses,  in  particular 
whether  one  used  the  Lin/Nevatia  heuristic  versus 
a  learned  classifier.  Because  we  were  also  interested 
in  the  behavior  of  different  methods,  we  compared 
the  Lin/Nevatia  scheme  to  all  the  learning  methods 
described  in  Section  4.  The  obvious  dependent  vari¬ 
able  is  overall  accuracy  of  the  learned  knowledge  on 
unseen  instances,  computed  as  the  percentage  of  cor¬ 
rectly  classified  test  instances.  However,  the  cost  of 
discarding  a  hypothesis  that  corresponds  to  an  ac¬ 
tual  rooftop  is  more  expensive  than  retaining  a  hy¬ 
pothesis  that  does  not.  Consequently,  we  also  mea¬ 
sured  accuracy  on  both  the  positive  and  negative 
instances. 

Our  first  study  took  the  form  of  a  common  type 
of  experiment  found  in  the  machine  learning  liter¬ 
ature  and  involved  forming  multiple  partitions  of  a 
data  set  into  training  set/testing  set  pairs.  For  a 
given  pair,  one  applies  the  learning  algorithm  to  the 
training  set  and  evaluates  the  acquired  knowledge 


on  the  test  set.  Averaging  results  across  all  pairs 
provides  a  good  estimate  of  the  accuracy  for  a  given 
algorithm,  since  it  minimizes  the  effect  of  unrepre¬ 
sentative  samples  in  the  training  or  testing  sets.  We 
conducted  this  experiment  using  data  derived  from 
image  FHOV1027  and  image  FHOV625.  We  ran¬ 
domly  generated  25  partitions  of  the  data  set.  For 
each  partition,  the  training  set  consisted  of  60%  of 
the  original  data  set,  while  the  testing  set  consisted 
of  the  remaining  40%;  we  then  applied  all  of  the 
learning  algorithms  to  each  train/test  pair.  For  this 
experiment,  we  used  the  MLC+-t-  library  of  machine 
learning  programs  [Kohavi  et  al.,  1996]. 

For  comparison,  we  also  ran  the  Lin/Nevatia  [1996] 
selection  criterion  on  each  test  set  and  computed  its 
average  accuracy.  Because  its  authors  constructed 
this  heuristic  manually,  no  training  was  involved.  Of 
course,  to  the  extent  that  Lin  and  Nevatia  modified 
it  in  response  to  its  behavior,  they  did  ‘train’  their 
heuristic,  but  this  training  was  conducted  using  sep¬ 
arate  images  from  a  ‘model  board’,  which  were  pho¬ 
tographs  of  a  scale-model  site  for  which  ground  truth 
was  known.  This  does  not  provide  the  best  compar¬ 
ison  possible.  However,  our  concern  here  is  not  with 
the  origin  of  their  heuristic  but  with  trying  to  im¬ 
prove  upon  it. 

Table  1  shows  the  results  of  the  within-image  learn¬ 
ing  experiments  using  data  generated  from  images 
FHOV1027  and  FHOV625.  The  table  presents  re¬ 
sults  for  each  image  in  terms  of  average  positive, 
negative,  and  overall  accuracy,  with  95%  confidence 
intervals.  The  highest  accuracies  for  the  positive 
class  and  overall  are  indicated  by  italics.  For  image 
FHOV1027  (Table  la),  naive  Bayes  performed  the 
worst  overall,  but  had  the  best  performance  for  the 
positive  class.  On  the  other  hand,  CN2  performed 
best  overall,  but  performed  poorly  in  terms  of  pos¬ 
itive  accuracy.  For  image  FHOV625  (Table  lb), 
the  Lin/Nevatia  classifier  performed  best  in  terms 
of  overall  accuracy  and  positive  accuracy,  whereas 
naive  Bayes  was  second,  but  only  in  terms  of  posi¬ 
tive  accuracy. 
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Figure  2:  ROC  curve  for  image  FHOV1027. 


To  provide  a  context  for  these  results,  consider  the 
performance  of  a  frequency-based  classifier  that  al¬ 
ways  predicts  the  most  frequent  class.  Given  the 
data  generated  from  images  FHOV1027  FHOV627, 
this  rule  would  always  predict  a  negative  instance, 
resulting  in  a  positive  accuracy  of  0%,  a  negative 
accuracy  of  100%,  and  an  overall  accuracy  reflecting 
the  percentage  of  negative  instances  in  the  data  set, 
or  87.1%.  An  extreme  bias  toward  positive  accuracy 
would  lead  one  to  always  predict  a  positive  instance, 
resulting  in  a  positive  accuracy  of  100%,  a  negative 
accuracy  of  0%,  and  an  overall  accuracy  reflecting 
the  percentage  of  positive  instances  in  the  data  set, 
or  12.9%.  The  Lin/Nevatia  rule,  naive  Bayes,  and 
nearest  neighbor  all  find  comfortable  trade-offs  be¬ 
tween  positive  and  negative  accuracy.  These  results 
are  roughly  consistent  with  our  hypothesis,  but  they 
are  hardly  conclusive. 

An  important  thing  to  note  about  the  results  on 
these  two  images  is  how  the  relative  position  of 
the  machine  learning  methods  stayed  essentially  the 
same.  The  one  exception,  on  the  second  image,  was 
that  CN2  performed  slightly  worse  than  the  Percep- 
tron  learning  algorithm.  This  consistency  is  encour¬ 
aging  because  it  suggests  that  we  can  identify  a  sin¬ 
gle  algorithm  that  will  perform  well  across  a  range 
of  images.  Hopefully,  experiments  with  more  images 
will  demonstrate  the  same  trend. 

7  The  Cost  of  Misclassification 

Given  the  above  results,  naive  Bayes  and  nearest 
neighbor  show  promise  of  being  comparable  to  the 
Lin/Nevatia  rule  in  terms  of  accuracy  on  the  positive 
instances.  Ideally,  we  would  like  to  improve  upon 
their  positive  accuracy  without  losing  much  negative 


accuracy.  To  achieve  this  affect,  the  learning  algo¬ 
rithm  must  be  biased  in  favor  of  positive  accuracy, 
but  most  machine  learning  methods  do  not  provide 
ways  to  accomplish  this.  Pazzani  et  al.  [1994]  have 
done  some  preliminary  work  along  these  lines,  which 
they  describe  as  addressing  differing  costs  of  error 
types.  The  basic  idea  is  to  change  the  way  the  algo¬ 
rithm  treats  positive  instances  relative  to  negatives, 
either  during  the  learning  process  or  at  the  time  of 
testing. 

This  approach  should  also  give  us  more  principled 
comparisons  among  the  various  learning  methods. 
By  systematically  varying  the  relative  costs,  we  can 
generate  a  Receiver  Operating  Characteristic  (ROC) 
curve,  which  graphs  negative  accuracies  as  a  func¬ 
tion  of  positive  accuracies.  The  ROC  curve  for  each 
algorithm  provides  a  cost-independent  summary  of 
its  behavior,  with  curves  that  cover  larger  areas  be¬ 
ing  generally  better.  This  suggests  a  revision  of 
our  hypothesis  from  the  previous  section:  as  one 
varies  error  costs,  machine  learning  will  produce 
ROC  curves  with  equal  or  larger  areas  than  that 
covered  by  the  Lin/Nevatia  classifier. 

To  test  this  hypothesis,  we  implemented  or  obtained 
cost  sensitive  versions  of  the  best  performing  algo¬ 
rithms  from  the  previous  experiments,  namely  naive 
Bayes,  nearest  neighbor,  and  C4.5.  We  defined  a 
cost  on  the  range  [-1.0,  1.0],  where  a  negative  set¬ 
ting  means  that  mistakes  on  the  positive  class  are 
more  costly  and  a  positive  setting  means  that  mis¬ 
takes  on  the  negative  class  are  more  costly.  Although 
this  formulation  differs  slightly  from  Pazzani  et  al.’s 
[1994],  it  is  equivalent  for  two-class  problems. 

For  naive  Bayes,  we  used  the  cost  sensitive  measure 
to  adjust  the  Bayesian  posterior  probabilities  [Duda 
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Figure  3:  ROC  curve  for  image  FHOV625. 


and  Hart,  1973,  Pazzani  et  al.,  1994].  Specifically, 
the  modified  algorithm  alters  the  posterior  proba¬ 
bility  of  the  preferred  class  so  that  it  becomes  more 
probable.  The  class  whose  posterior  probability  is 
modified,  and  the  degree  to  which  it  is  changed,  de¬ 
pends  on  the  sign  and  magnitude  of  the  cost  metric. 

We  introduced  a  cost  sensitive  measure  into  the 
nearest  neighbor  algorithm  by  adjusting  the  distance 
from  an  unknown  instance  to  its  closest  neighbor 
for  each  class,  positive  and  negative.  After  making 
this  adjustment  based  on  the  sign  and  magnitude  of 
the  cost  measure,  the  classification  process  proceeds 
normally,  assigning  the  class  label  of  the  “closest” 
neighbor.  This  modification  also  works  for  versions 
of  the  algorithm  that  consider  more  than  one  neigh¬ 
bor  when  classifying  unknown  instances. 

We  were  able  to  make  similar  modifications  to  the 
Lin/Nevatia  classifier  [1996]  for  purposes  of  compar¬ 
ison.  Since  this  classifier  is  a  linear  discriminant 
function,  we  adjusted  the  threshold  such  that  the 
decision  boundary  is  closer  to  the  hypothetical  clus¬ 
ter  of  positive  examples  or  the  cluster  of  negative 
examples.  The  direction  and  degree  to  which  we  ad¬ 
just  the  threshold  is  again  dependent  on  the  sign  and 
magnitude  of  the  cost  measure. 

We  also  obtained  a  cost  sensitive  version  of  C4.5 
[Grimmer,  1997],  which  takes  a  different  approach  to 
learning  minimum  cost  classifiers  than  ours.  Rather 
than  using  cost  sensitive  measures  in  the  testing 
phase,  it  takes  cost  into  account  during  the  learn¬ 
ing  phase  when  it  prunes  the  decision  tree  [Breiman 
et  al.,  1984].  Briefly,  the  pruning  algorithm  selects 
the  least  costly  of  three  actions  for  each  subtree:  do 
nothing  (i.e.,  leave  the  subtree  unpruned);  replace 
the  subtree  with  a  leaf  node  and  assign  the  major¬ 


ity  class  label  of  the  subtree  to  the  leaf  node;  and 
replace  the  subtree  with  the  subtree  of  one  of  its 
children.  Costs  for  classes  in  this  version  of  C4.5  are 
expressed  on  the  integer  range  [0,  oo]. 

To  investigate  the  effect  of  misclassification  costs, 
we  conducted  an  experiment  using  the  cost  sensi¬ 
tive  versions  of  nearest  neighbor,  naive  Bayes,  C4.5, 
and  the  Lin/Nevatia  classifier.  One  condition  used 
data  from  image  FHOV1027,  while  another  used 
data  from  image  FHOV625.  For  each  algorithm  and 
image,  we  varied  the  cost  metric  and  measured  the 
resulting  positive  and  negative  accuracy  for  ten  runs; 
each  run  involved  partitioning  the  data  set  randomly 
into  training  (60%)  and  testing  (40%)  sets.  For  each 
cost  setting  and  each  classifier,  we  plotted  the  av¬ 
erage  true  negative  rate  (i.e.,  accuracy  on  negative 
cases)  against  the  true  positive  rate  (i.e.,  accuracy 
on  positive  cases).  Figures  2  and  3  show  the  ROC 
curves  that  resulted  from  this  procedure  for  images 
FHOV1027  and  FHOV625,  respectively. 

The  most  notable  aspect  of  Figure  2  is  that  the 
curves  for  most  of  the  learned  classifiers  are  nearly 
identical  with  that  for  the  Lin/Nevatia  rule,  giving 
support  for  our  hypothesis  that  machine  learning  can 
at  least  match  handcrafted  knowledge.  However, 
portions  of  the  ROC  curve  for  cost  sensitive  near¬ 
est  neighbor  stand  out  as  substantially  higher  than 
others,  suggesting  that  this  method  outperforms 
both  the  Lin/Nevatia  and  the  alternative  learning 
schemes.  This  difference  was  not  apparent  from  our 
earlier  experiment,  and  shows  the  clear  advantage 
of  using  ROC  curves  to  compare  methods  in  cost 
sensitive  domains. 

Inspection  of  Figure  3  reveals  additional  support  for 
our  basic  hypothesis,  since  again  the  Lin/Nevatia 
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Table  2:  Experimental  results  for  between-image  learning  using  data  from  two  Fort  Hood  images. 


(a)  Train  FHOV1027/Test  FHOV625 

(b)  Train  FHOV625/Test  FHOV1027 

Positive 

Accuracy 

Negative 

Accuracy 

Overall 

Accuracy 

Positive 

Accuracy 

Negative 

Accuracy 

Overall 

Accuracy 

Lin/Nevatia 

Naive  Bayes 
Nearest  neighbor 
C4.5  (w/pruning) 
CN2 

Perceptron 

68.00±0.0 

33.72±0.5 

25.96±0.8 

24.56±1.5 

60.00±0.0 

O.OOTO.O 

lOO.OOiO.O 

98.12±0.1 

97.16±0.2 

98.00±0.3 

lOO.OOiO.O 

lOO.OOiO.O 

91.52±0.0 

91.32T0.1 

89.48±0.2 

89.98±0.2 

89.60±0.0 

89.15±0.0 

50.00±0.0 

86.44^0.7 

71.36±1.3 

58.60±2.5 

0.00±0.0 

1.28±0.8 

lOO.OOiO.O 

64.12±1.4 

80.60±0.5 

84.40±0.9 

lOO.OOiO.O 

lOO.OOiO.O 

83.12±0.0 

67.88±1.0 

79.12±0.3 

80.10±0.7 

83.30±0.0 

83.47±0.1 

curve  runs  the  same  course  as  those  for  most  learned 
classifiers.  But  for  this  image,  nearest  neighbor 
does  no  better  than  the  other  induction  algorithms, 
and  portions  of  the  C4.5  curve  appear  substantially 
worse.  Clearly,  we  should  replicate  these  results  on 
more  images,  but  the  results  so  far  are  generally  en¬ 
couraging. 

8  Between-image  Learning 

We  geared  our  final  study  more  toward  the  goals 
of  image  analysis.  Recall  that  our  motivating  prob¬ 
lem  is  the  large  number  of  images  that  must  be  pro¬ 
cessed.  In  order  to  alleviate  the  burden  on  the  image 
analyst,  we  want  to  apply  knowledge  learned  from 
some  images  to  many  other  images.  We  have  already 
noted  that  several  dimensions  of  variation  pose  prob¬ 
lems  to  transferring  learned  knowledge  to  new  im¬ 
ages.  For  example,  one  viewpoint  of  a  given  site 
can  differ  from  other  viewpoints  of  the  same  site  in 
both  orientation  and  angle  from  the  perpendicular. 
We  need  to  better  understand  how  the  knowledge 
learned  from  one  image  generalizes  to  other  images 
that  differ  along  these  dimensions.  Images  taken  at 
different  times  and  images  of  different  sites  present 
similar  issues.  Our  hypothesis  here  was  a  stronger 
version  of  earlier  ones:  classifiers  learned  from  one 
image  would  perform  as  well  or  better  on  unseen  im¬ 
ages  than  handcrafted  classifiers.  However,  we  also 
expected  that  such  between-image  learning  would 
give  less  impressive  results  than  the  within-image 
situation. 

To  test  our  predictions,  we  simply  examined  gener¬ 
alization  across  the  two  images,  which  differ  both 
in  viewing  angle  and  in  time.  Clearly,  future  ex¬ 
periments  should  systematically  vary  each  of  these 
independent  variables,  to  determine  their  individual 
effect  on  transfer,  but  the  current  comparison  should 
give  us  some  insight  into  how  well  learned  knowledge 
carries  across  images.  For  each  learning  algorithm, 
we  conducted  25  runs  in  which  60%  of  the  hypothe¬ 
ses  from  image  FHOV1027  served  as  the  training 
set  and  all  hypotheses  from  image  FHOV625  as  the 


test  set.  In  addition,  we  carried  out  the  same  proce¬ 
dure  using  60%  of  the  data  from  image  FHOV625  for 
training  and  all  of  the  data  from  image  FHOV1027 
for  testing.  For  this  experiment,  we  again  took  ad¬ 
vantage  of  the  MLC-t-t-  library  of  machine  learning 
programs  [Kohavi  et  ah,  1996]. 

The  results  for  this  between-image  experiment  ap¬ 
pear  in  Table  2.  For  the  first  condition  (Table  2a), 
the  Lin/Nevatia  rule  performed  the  same  as  in  the 
second  within-image  learning  condition  (Table  lb), 
since  the  same  data  was  used  for  testing.  Recall 
that  no  comparable  notion  of  generalization  applies 
to  the  Lin/Nevatia  classifier  since  it  was  handcrafted 
for  unrelated  data.  The  overall  performance  of  naive 
Bayes  was  very  high,  but  its  predictive  accuracy 
on  the  positive  class  was  much  less  than  for  CN2 
and  the  Lin/Nevatia  classifier.  CN2  achieved  a  re¬ 
spectable  accuracy  on  the  positive  examples  for  this 
problem,  in  contrast  to  its  performance  in  the  ex¬ 
periment  on  within-image  learning. 

For  the  second  between-image  condition  (Table  2b), 
the  Lin/Nevatia  rule  again  performed  as  it  did  in 
the  first  within-image  condition  (Table  la),  since 
the  same  data  was  used  for  testing.  Although  the 
Perceptron  algorithm  achieved  the  highest  overall 
accuracy,  CN2  and  the  Lin/Nevatia  rule  performed 
nearly  as  well.  In  terms  of  positive  accuracy,  naive 
Bayes  and  nearest  neighbor  performed  highest;  yet 
in  terms  of  overall  accuracy,  these  two  methods  fared 
the  worst. 

However,  evaluating  our  hypothesis  about  between- 
image  generalization  involves  comparing  with  accu¬ 
racies  from  the  within-image  study,  and  here  the 
results  are  mixed.  Table  lb  reports  performance 
when  we  trained  and  tested  the  learning  methods 
on  image  FHOV625,  whereas  Table  2a  reports  per¬ 
formance  for  training  on  FHOV1027  and  testing  on 
FHOV625.  As  predicted,  the  positive  accuracies 
for  naive  Bayes,  nearest  neighbor,  and  C4.5  in  the 
between-image  case  are  lower  than  for  the  within- 
image  condition,  suggesting  considerably  less  gener¬ 
alization  across  images  than  within  them. 
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But,  comparing  Table  la  (training  and  testing  on 
FHOV1027)  with  Table  2b  reveals  a  different  story. 
Here  the  positive  accuracies  are  substantially  higher 
for  the  between-image  condition,  suggesting  that 
generalization  was  actually  better  across  images 
than  within  them.  This  finding  is  more  encourag¬ 
ing  but  runs  counter  to  our  expectation  that  train¬ 
ing  and  testing  on  different  images  would  be  a  more 
difficult  learning  task  than  training  and  testing  on 
the  same  image.  The  results  are  further  confused 
by  CN2’s  behavior,  which  showed  higher  accuracies 
than  expected  in  Table  2a  but  not  in  Table  2b. 
Clearly,  the  results  from  this  experiment  are  in¬ 
conclusive,  and  we  suspect  that  both  the  large  de¬ 
crease  in  FHOV1027  accuracies  and  the  increase  in 
FHOV625  accuracies  are  artifacts  due  to  the  particu¬ 
lar  distribution  of  positive  and  negative  instances  in 
these  data  sets.  Repeating  the  study  using  cost  sen¬ 
sitive  versions  of  the  learning  algorithms,  and  calcu¬ 
lating  ROC  curves  for  each  pair  of  training  and  test 
images,  should  reveal  the  generalization  for  each  case 
in  a  distribution-free  manner.  In  this  framework,  we 
would  expect  that  the  area  under  the  ROC  curve  for 
between-image  learning  on  a  given  test  image  and  a 
given  method  will  be  less  than  the  area  for  within- 
image  learning  on  the  same  image  and  method. 

In  summary,  our  experiments  have  revealed  some 
factors  that  influence  performance  on  the  rooftop 
selection  task  —  the  learning  algorithm  used  to  ac¬ 
quire  knowledge,  the  relative  cost  of  classification 
errors,  and  the  nature  of  the  images  themselves. 
We  showed  that,  at  least  for  positive  accuracies, 
the  naive  Bayesian  and  nearest  neighbor  classifiers 
closely  approach  (and  sometimes  exceed)  the  hand¬ 
crafted  Lin/Nevatia  heuristic.  Although  it  is  diffi¬ 
cult  to  identify  the  conditions  under  generalization 
occurs  between  image,  we  have  available  a  method¬ 
ology  that  will  help  us  investigate  this  issue  further. 

9  Related  Work  on  Visual  Learning 

Research  on  learning  in  computer  vision  has  become 
increasingly  common  in  recent  years.  Papers  by 
Conklin  [1993],  Sengupta  and  Boyer  [1993],  Cook 
et  al.  [1993],  and  Provan  et  al.  [1996]  all  describe 
approaches  to  learning  three-dimensional  descrip¬ 
tions  for  use  in  object  recognition.  Another  ap¬ 
proach  [e.g.,  Gros,  1993,  Pope  and  Lowe,  1996]  in¬ 
stead  learns  characteristic  views  for  use  in  recogni¬ 
tion,  while  still  others  focus  on  learning  the  appear¬ 
ances  of  objects  or  scenes  [e.g.,  Nayar  et  al.,  1996, 
Pomerleau,  1996,  Viola,  1993]. 

Most  work  on  visual  learning  ignores  the  importance 
of  misclassification  costs,  but  our  work  along  these 
lines  has  some  precedents.  In  particular.  Draper  et 
al.  [1994]  incorporate  the  cost  of  errors  into  their  al¬ 


gorithm  for  constructing  and  pruning  multivariate 
decision  trees.  They  tested  this  approach  on  the 
task  of  labeling  pixels  from  outdoor  images  for  use 
by  a  road-following  vehicle.  They  determined  that, 
in  this  context,  labeling  a  road  pixel  as  non-road  was 
more  costly  than  the  reverse,  and  showed  experimen¬ 
tally  that  their  method  could  reduce  such  errors  on 
novel  test  pixels. 

However,  like  much  of  the  research  on  visual  learn¬ 
ing,  Draper  et  al.’s  work  focused  on  image  processing 
in  complex  scenes  at  eye  level.  One  exception  is  Con¬ 
nell  and  Brady’s  [1987]  work  on  learning  structural 
descriptions  of  airplanes  from  aerial  views.  Their 
method  converted  training  images  into  semantic  net¬ 
works,  which  it  then  generalized  on  comparison  with 
descriptions  of  other  instances.  However,  Connell 
and  Brady  do  not  appear  to  have  tested  experimen¬ 
tally  their  algorithm’s  ability  to  accurately  classify 
objects  in  new  images. 

Draper  [1996]  reports  a  more  careful  study  of  learn¬ 
ing  in  the  context  of  analyzing  cierial  images.  His 
approach  adapts  methods  for  reinforcement  learning 
to  assign  credit  in  multi-stage  image  processing  (for 
software  similar  to  the  Lin/Nevatia  system),  then 
uses  an  induction  method  (backpropagation  in  neu¬ 
ral  networks)  to  learn  conditions  on  operator  selec¬ 
tion.  He  presents  initial  results  on  a  RADIUS  task 
that  also  involves  the  detection  of  roofs. 

Our  framework  shares  some  features  with  Draper’s 
approach,  but  assumes  that  learning  is  directed  by 
feedback  from  a  human  expert.  We  predict  that 
our  supervised  method  will  be  more  computation¬ 
ally  tractable  than  his  use  of  reinforcement  learning, 
which  is  well  known  for  its  high  complexity.  Our 
approach  does  require  more  interaction  with  users, 
but  we  believe  this  interaction  will  be  unobtrusive 
if  cast  within  the  context  of  an  advisory  system  for 
image  analysis. 

10  Concluding  Remarks 

Although  our  initial  studies  have  provided  some  in¬ 
sights  into  the  role  of  machine  learning  in  image 
analysis,  much  still  remains  to  be  done.  For  ex¬ 
ample,  we  may  want  to  consider  alternate  measures 
of  classification  accuracy  that  take  into  account  the 
presence  of  multiple  valid  hypotheses  for  a  given 
rooftop.  Classifying  one  of  these  hypotheses  cor¬ 
rectly  is  sufficient.  In  addition,  although  the  rooftop 
selection  stage  was  a  natural  place  to  start  in  apply¬ 
ing  our  methods,  we  are  also  interested  in  working 
at  both  earlier  and  later  levels  of  the  process.  Note 
that  the  goal  here  is  not  only  to  increase  classifi¬ 
cation  accuracy,  which  could  be  handled  entirely  by 
hypothesis  selection,  but  to  reduce  the  complexity  of 
processing  by  removing  bad  hypotheses  before  they 


843 


are  aggregated  into  larger  structures.  With  this  aim 
in  mind,  we  plan  to  extend  our  work  to  apply  a.t  all 
levels  of  the  image  understanding  process. 

We  must  address  a  number  of  issues  before  we  can 
apply  machine  learning  to  other  stages.  One  in¬ 
volves  identifying  the  cost  of  different  errors  at  each 
level,  and  taking  this  into  account  in  our  modified 
induction  algorithms.  Another  concerns  whether  we 
should  use  the  same  induction  algorithm  at  each 
level  or  use  different  methods  at  each  stage.  We 
should  also  explore  using  a  number  of  learning  meth¬ 
ods  in  combination,  either  averaging  their  predic¬ 
tions  (as  in  work  on  ‘ensembles’)  or  cascading  the 
results  (as  in  work  on  ‘boosting’). 

As  we  mentioned  earlier,  in  order  to  automate  the 
collection  of  training  data  for  learning,  we  also  hope 
to  integrate  learning  routines  into  the  Lin/Nevatia 
software.  This  system  was  not  designed  initially  to 
be  interactive,  but  we  would  like  to  modify  it  so  that 
the  image  analyst  can  accept  or  reject  recommen¬ 
dations  made  by  the  image  understanding  system, 
generating  training  data  in  the  process.  At  intervals 
the  system  would  invoke  its  learning  algorithms,  pro¬ 
ducing  revised  knowledge  that  would  alter  the  sys¬ 
tem’s  behavior  in  the  future  and,  hopefully,  reduce 
the  user’s  need  to  make  corrections.  The  interactive 
labeling  system  described  in  Section  5  could  serve  as 
an  initial  model  for  this  interface. 

In  conclusion,  our  studies  suggest  that  machine 
learning  has  an  important  role  to  play  in  improv¬ 
ing  the  accuracy,  and  thus  the  robustness,  of  image 
analysis  systems.  .However,  we  need  additional  ex¬ 
periments  to  give  better  understanding  of  the  fac¬ 
tors  affecting  between-image  transfer  and  we  need 
to  extend  learning  to  additional  levels  of  the  image 
understanding  process.  Also,  before  we  can  develop 
a  system  that  truly  aids  the  human  image  analyst, 
we  must  develop  and  implement  unobtrusive  ways 
to  collect  training  data  to  support  learning. 
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Abstract 

The  ability  to  locate  scenes  and  objects  visible  in 
a  video/image  frame  to  their  corresponding  locations 
and  coordinates  in  a  reference  coordinate  system  will 
be  important  in  visually-guided  navigation,  surveillance 
and  monitoring  systems  of  the  future.  Aerial  video  is 
rapidly  emerging  as  a  low  cost,  widely  used  source  of 
imagery  for  mapping,  surveillance  and  monitoring  appli¬ 
cations.  A  database  of  reference  imagery  and  the  as¬ 
sociated  geo-coordinates  (e.g.  latitude/longitude)  is  of¬ 
ten  available  for  locales  that  are  surveyed  using  current 
videos.  However,  a  key  technical  problem  of  locating 
objects  and  scenes  in  a  video  to  their  geo-coordinates 
needs  to  be  solved  in  order  to  ascertain  the  geo-location 
of  objects  seen  from  the  camera  platform’s  current  lo¬ 
cation.  In  this  paper  we  present  the  design  of  a  system 
and  key  algorithms  for  the  problem  of  accurate  mapping 
between  camera  coordinates  and  geo-coordinates,  called 
geo-spatial  registration.  Current  systems  for  geo-location 
use  the  position  and  attitude  information  for  the  moving 
platform  in  some  fixed  world  coordinates  to  locate  the 
video  frames  in  the  reference  database.  However,  the  ac¬ 
curacy  achieved  is  only  of  the  order  of  lO’s  to  lOO’s  of  pix¬ 
els.  Our  approach  utilizes  the  imagery  and  terrain  infor¬ 
mation  contained  in  the  geo-spatial  database  to  precisely 
align  dynamic  videos  with  the  reference  imagery  and 
thus  achieves  a  much  higher  accuracy.  Applications  of 
geo-spatial  registration  include  text /graphical /audio  an¬ 
notations  of  objects  of  interest  in  the  current  video  using 
the  stored  annotations  in  the  reference  database.  These 
applications  extend  beyond  aerial  videos  into  the  chal¬ 
lenging  domain  of  video/image-based  map  and  database 
indexing  of  arbitrary  locales,  like  cities  and  urban  areas. 

1  Introduction 

Aerial  video  is  rapidly  emerging  as  a  low  cost, 
widely  used  source  of  imagery  for  mapping,  surveil¬ 
lance  and  monitoring  applications.  Visible  and  in¬ 
frared  (IR)  video  cameras  are  increasingly  deployed 
on  airborne  platforms,  both  manned  and  unmanned, 
to  provide  observers  with  a  real  time  view  of  activity 
and  terrain.  However,  there  remains  a  key  technical 
problem  in  the  use  of  video  from  moving  vehicles: 
determining  how  the  locations  of  objects  in  a  video 
display  relate  to  the  geographic  locations  of  these 
objects  on  the  observed  scene.  This  relationship  be¬ 
tween  image  coordinates  and  geographic  coordinates 
must  be  known  before  actions  can  be  undertaken 


based  on  the  video.  This  mapping  between  camera 
coordinates  and  ground  coordinates,  called  geospa¬ 
tial  registration,  depends  both  on  the  location  and 
orientation  of  the  camera  and  on  the  distance  and 
topology  of  the  ground.  Camera  location  continu¬ 
ally  changes  as  the  airborne  camera  moves.  Rough 
geospatial  registration  can  be  derived  from  camera 
telemetry  data  (GPS  location  of  the  aircraft  and  ori¬ 
entation  of  the  camera)  and  digital  terrain  map  data 
(from  a  database)  [3].  This  form  of  registration  is  the 
best  available  in  fielded  systems  today  but  does  not 
provide  the  precision  needed  for  many  tasks.  Higher 
precision  will  be  achieved  by  correlating  (and  regis¬ 
tering)  observed  video  frames  in  real  time  to  stored 
references  imagery.  The  reference  imagery  includes 
previously  collected  satellite  images  that  have  been 
precisely  aligned  to  map  coordinates.  An  applica¬ 
tion  of  geospatial  registration  may  be  to  annotate 
objects  observed  in  the  video  by  their  names  or  over¬ 
lay  maps,  boundaries  and  other  graphical  features 
over  the  video  imagery. 

In  this  paper,  we  present  the  design  of  a  sys¬ 
tem  for  accurate  geospatial  registration.  We  then 
present  the  details  and  results  of  some  of  the  key 
algorithms  [7]  we  have  developed  in  our  laboratory 
towards  implementing  the  overall  system.  This  work 
is  in  contrast  with  the  previous  body  of  work  based 
on  still  imagery  exploitation  using  site  models  [9]. 

1.1  Components  of  a  Geo-registration 
System 

A  schematic  of  the  geo-registration  and  anno¬ 
tation  concept  is  shown  in  Figure  1.  The  figure 
shows  a  mobile  platform  capturing  current  videos 
of  a  locale.  After  geo-registration  to  the  reference 
imagery  database,  the  footprints  of  the  video  are 
shown  overlaid  on  the  reference  imagery,  and  lat¬ 
itude/longitude/height  of  points  of  interest  are  re¬ 
trieved  based  on  geo-registration  and  are  overlaid 
on  the  relevant  points  on  the  video  frame. 

Figure  2  shows  the  functional  layout  of  the  ma¬ 
jor  blocks  and  flow  of  information  in  the  geospatial 
registration  indexing,  alignment,  annotation  display, 
and  synthetic  view  creation  system.  The  system  can 
broadly  be  divided  into  the  following  components: 


*The  work  reported  here  was  funded  in  part  by  the  Na¬ 
tional  Information  Display  Laboratory,  Princeton,  New  Jersey 
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Geo-referencing/location  module  for  locating 
video  frames  within  an  imagery  and  multi- 


Figure  1:  A  schematic  of  the  Geo-registration  concept. 


modal  annotated  database  that  is  registered 
with  geographical  coordinates. 

•  Annotation  overlay  module  that  accesses  the 
relevant  annotations  using  the  geo-location 
module  and  overlays  these  on  video  frames. 

•  Database  of  reference  imagery,  DEM  (Digital 
Elevation  Map)  and  multi-modal  annotations 
such  as  graphics,  text,  audio,  maps  etc. 

•  New  view  generation  module  that  can  show 
views  of  the  reference  database  from  the  van¬ 
tage  point  of  the  video  capture  platform. 

The  engineering  support  data  (ESD:  GPS,  camera 
look  angle  etc.)  supplied  with  the  video  is  decoded 
to  define  the  camera  model  (position  and  attitude) 
with  respect  to  the  reference  database.  The  cam¬ 
era  model  and  scene  data  base  is  used  to  apply  an 
image  perspective  transformation  to  create  a  set  of 
synthetic  reference  images  from  the  perspective  of 
the  sensor.  This  set  of  reference  images  is  fed  to 
the  real-time  video  processing  system  to  be  used  for 
indexing  and  fine  alignment^  Simultaneously,  the 
system  indexes  into  the  geospatial  feature  data  base 
according  to  the  geo-coordinate  footprint  of  sen¬ 
sor  and  extracts  candidate  annotations  for  overlay. 
These  annotations  are  then  fed  to  the  real-time  video 
processing  system.  The  real-time  system  computes 
the  precise  locations  for  overlaying  the  annotations 
on  live  video.  The  annotations  are  overlaid  by  the 
video  mixer  on  the  analysts  reference  overview  im¬ 
age  monitor  window  with  attached  sound/text  refer¬ 
ences  for  point  and  click  recall.  Finally,  the  current 
estimate  of  sensor  attitude  and  position  is  updated 
based  upon  results  of  matching  from  real-time  reg¬ 
istration  module.  This  information  is  used  to  gener¬ 
ate  new  reference  images  to  support  matching  based 

^The  alignment  step  does  not  require  accurate  knowledge 
of  the  video  camera’s  calibration  parameters. 


upon  new  estimates  of  sensor  position  and  attitude 
and  the  whole  process  is  iterated. 

1.2  From  Videos  to  Mosaics  to  Geo- 
registration 

Given  that  ESD  on  its  own  will  not  be  reliable 
in  associating  objects  seen  in  videos  to  their  corre¬ 
sponding  geo  locations,  we  utilize  the  precision  in 
localization  afforded  by  the  alignment  of  rich  visual 
attributes  typically  available  in  video  imagery.  For 
aerial  surveillance  scenarios,  often  a  reference  im¬ 
age  database  in  geo-coordinates  along  with  the  as¬ 
sociated  DEM  maps  and  annotations  is  available. 
The  visual  features  available  in  the  reference  imagery 
database  are  correlated  with  those  in  video  imagery 
to  achieve  an  order  of  magnitude  improvement  in 
alignment  in  comparison  to  purely  ESD  based  align¬ 
ment.  Our  approach  uses  the  ESD  to  generate  ini¬ 
tial  hypothesis  for  the  geo-located  locale  of  interest 
for  the  current  video  data.  The  initial  hypothesis  is 
typically  a  section  of  the  reference  imagery  warped 
and  transformed  so  that  it  approximates  well  the  vi¬ 
sual  appearance  of  the  relevant  locale  from  the  view¬ 
point  specified  by  the  ESD.  Subsequently,  precise 
sub-pixel  alignment  between  the  video  frame  and  ref¬ 
erence  imagery  corresponding  to  the  relevant  locale 
is  used  for  precise  geo-location  of  the  video  frame. 
The  position  and  attitude  information  provided  by 
ESD  may  be  good  enough  to  generate  a  3Kx3K  sec¬ 
tion  of  the  relevant  reference  imagery  with  a  reso¬ 
lution  of  about  1-2  feet  per  pixel.  This  clearly  will 
not  be  adequate  for  overlaying  potentially  complex 
annotations  like  maps  and  object  models.  Accurate 
sustained  overlays  over  time  on  the  video  imagery 
require  constant  maintenance  of  precise  alignment 
between  the  geo-referenced  annotations  and  video 
frames.  The  process  of  precise  alignment  of  video 
to  reference  imagery  itself  needs  to  be  divided  into 
three  main  steps: 
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FUNCTIONAL  LAYOUT  OF  THE  VIDEO  INDEXING,  ALIGNMENT 
AND  SYNTHETIC  FLY-  THROUGH  SYSTEM 


Figure  2:  Functional  Specification  Block  Diagram. 


•  Video-to-video  frame  alignment  and  mosaic  cre¬ 
ation. 

•  Coarse  indexing/pruning  of  match  locations  in 
the  reference  imagery  to  eliminate  all  but  the 
best  possible  location  of  the  video  imagery  in 
the  reference  to  with  a  few  pixels  of  its  true 
location, 

•  Precise  parametric  alignment  with  and/or  with¬ 
out  the  DEM  map  to  obtain  sub-pixel  localiza¬ 
tion  of  the  video  frame  in  reference  imagery. 

•  Video-to-reference  tracking  based  alignment 
mode. 

2  Lens  Distortion  Corrected  Video 
Mosaicing 

Video  frames  are  typically  acquired  at  30  frames  a 
second  and  contain  a  lot  of  frame-to-frame  overlap. 
For  typical  altitudes  and  speeds  of  airborne  plat¬ 
forms,  the  overlaps  may  range  from  4/5  to  49/50th 
of  a  single  frame.  Therefore,  conversion  of  video 
frames  to  video  mosaics  is  an  efficient  way  to  keep 
up  with  the  incoming  video  stream.  We  exploit  the 
redundancy  in  video  frames  by  aligning  successive 


video  frames  with  low  order  parametric  transforma¬ 
tions  like  translation,  affine  and  projective  transfor¬ 
mations.  The  frame-to-frame  alignment  parameters 
enable  the  creation  of  a  single  extended  view  mo¬ 
saic  image  that  authentically  represents  all  the  in¬ 
formation  contained  in  the  aligned  input  frames  in 
a  compact  image.  For  instance,  typically  30  frames 
of  standard  NTSC  resolution  (720x480)  containing 
about  lOM  pixels  may  be  reduced  to  a  single  mosaic 
image  containing  only  about  200K  to  2M  pixels  de¬ 
pending  on  the  overlap  between  successive  frames. 
The  video  mosaic  is  subsequently  available  for  geo- 
referencing  and  location. 

In  the  work  reported  here,  instead  of  using  affine 
transformations  as  in  [5,  10],  the  frame-to-frame 
alignment  process  has  been  changed  in  two  signif¬ 
icant  ways:  (i)  we  compute  the  projective  transfor¬ 
mation  [14],  and  (ii)  the  mosaicing  process  is  ex¬ 
tended  to  handle  unknown  lens  distortion  present  in 
the  image. 

The  affine  transform  models  the  image  projection 
used  creating  the  aerial  video  as  an  orthographic 
projection,  while  the  projective  transform  models 
the  image  projection  as  an  perspective  projection. 
Since,  our  final  goal  is  to  precisely  align  the  video 
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Figure  3:  Projective  transformation  based  video  mo¬ 
saic  for  the  oblique  video. 


mosaics  with  the  reference  imagery,  the  video  mo¬ 
saic  is  created  by  using  the  projective  transforma¬ 
tion.  An  example  of  such  a  mosaic  is  shown  in  Fig¬ 
ure  3.  Note,  in  the  video  mosaicing  process  we  are 
not  taking  into  account  3D  parallax.  However,  in 
the  overall  frame  work,  a  new  mosaic  is  constructed 
every  second  or  so.  In  typical  scenarios,  the  3D  par¬ 
allax  observed  in  the  aerial  video  for  such  small  time 
frames  is  very  minimal. 

Often  in  aerial  video  streams,  the  lens  distortion 
parameters  must  be  explicitly  modeled  in  the  esti¬ 
mation  process.  A  fundamental  assumption  made 
in  the  earlier  work  on  mosaicing  was  that  one  im¬ 
age  could  be  chosen  as  the  reference  image  and  the 
mosaic  would  be  constructed  by  merging  all  other 
images  to  this  reference  image.  However,  in  the  case 
when  lens  distortion  is  present,  this  is  not  true.  We 
extend  the  direct  estimation  algorithms  to  use  a  ref¬ 
erence  coordinate  system  but  not  a  reference  image. 
We  compute  the  motion  parameters  which  warp  all 
images  to  a  virtual  image  mosaic  in  this  reference 
coordinate  system.  Each  pixel  in  this  virtual  image 
mosaic  may  be  is  predicted  by  intensities  from  more 
than  one  image.  The  error  measure  we  minimize  is 
the  sum  of  the  variances  or  predicted  pixel  intensi¬ 
ties  at  each  pixel  location  summed  over  the  virtual 
image.  In  order  to  compute  the  correspondences  and 
the  unknown  parameters  simultaneously,  we  formu¬ 
late  an  error  function  that  minimizes  the  variance 
in  intensities  of  a  set  of  corresponding  points  in  the 
images,  that  map  to  the  same  ideal  reference  coordi¬ 
nate.  Formally,  the  unknown  projective  transforma¬ 
tion  parameters  for  each  frame,  ,  and  the 

lens  distortion  parameter,  71  are  solved  for  through: 


min 

...A^  ,71 


E 


1 


(1) 


where  point  p*  in  frame  z  is  a  transformation  of  a 
point  p  in  the  reference  coordinate  system,  7(p)  is 
the  mean  intensity  value  of  all  the  p'’s  that  map  to 
p,  and  M(p)  is  a  count  of  all  such  p”s.  Therefore, 
given  a  point  p  in  the  reference  coordinates,  each 


term  in  the  sum  in  Equation  1  is  the  variance  of  all 
the  intensity  values  at  points  p‘  that  map  to  point  p. 
An  example  of  alignment  and  mosaic  construction 
using  this  process  is  shown  in  the  Figures  4  and  5. 


Figure  5:  Lens  distortion  corrected  video  mosaic  of 
three  frames  of  an  aerial  video. 

3  Coarse  Indexing/Matching 

The  video  mosaics  created  at  regular  intervals 
(typically  1  each  second)  need  to  be  geo-  referenced 
with  the  reference  database.  Given  the  size  of  the 
mosaics  and  the  relevant  piece  of  the  reference  im¬ 
agery,  for  real  time  constraints,  the  process  of  locat¬ 
ing  the  video  mosaic  within  the  reference  coordinates 
needs  to  be  divided  into  a  coarse  indexing/matching 
step  and  a  precise  alignment  step.  The  coarse  index¬ 
ing  step  locates  a  video  mosaic  within  a  reference 
image  using  visual  appearance  features.  In  prin¬ 
ciple,  one  could  exhaustively  correlate  the  intensi¬ 
ties  in  the  video  mosaic  and  the  reference  image  at 
each  pixel  and  find  the  best  match.  However,  due 
to  the  uncertainties  in  viewpoint  from  ESD  and  due 
to  real  changes  in  appearance  between  the  reference 
imagery  and  the  current  video,  it  may  not  be  pos¬ 
sible  to  directly  correlate  intensities  in  the  two  im¬ 
ages.  The  real  changes  in  appearance  may  be  due 
to  change  of  reflectance  of  objects  and  surfaces  in 
the  scene  (e.g.  summer  to  fall)  and  due  to  difference 
in  illumination  between  the  reference  and  the  video 
imagery.  Changes  in  appearance  due  to  viewpoint 
are  accounted  for  to  a  large  extent  by  the  process  of 
warping  the  reference  image  to  the  ESD  viewpoint. 
However,  for  robust  matching  and  localization,  in¬ 
dexing  and  matching  needs  to  be  resilient  to  uncer¬ 
tainties  in  ESD  and  to  real  changes. 

We  propose  to  solve  this  problem  by  computing 
features  at  multiple  scales  and  multiple  orientations 
that  are  invariant  or  quasi-invariant  to  changes  in 
viewpoint  .  These  features  are  computed  at  many 
salient  locations  both  in  the  reference  and  video  im¬ 
agery  [8].  The  salient  locations  are  determined  au¬ 
tomatically  based  on  distinctiveness  of  local  image 
structure.  Even  with  the  feature  representations 
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Figure  4;  Top:  Three  frames  of  an  aerial  video  clip. 


at  salient  locations  only,  there  may  be  too  much 
data  for  exhaustive  matching.  Therefore,  in,  the 
first  step,  fast  indexing  of  the  multi-  dimensional 
visual  features  is  used  to  eliminate  most  of  the  false 
matches  [2,  12].  Subsequently,  exhaustive  matching 
of  the  small  set  of  remaining  candidate  matches  leads 
to  the  correct  coarse  location  of  the  video  imagery 
in  the  refei'ence  coordinates. 

4  Fine  Geo-registration 

The  coarse  localization  is  used  to  initialize  the  pro¬ 
cess  of  fine  alignment  while  accounting  for  the  ge¬ 
ometric  and  photometric  transformations  between 
the  video  and  reference  imagery.  In  general,  the 
transformation  between  two  views  of  a  scene  can  be 
modeled  by  (i)  an  external  coordinate  transforma¬ 
tion  that  specifies  the  3D  alignment  parameters  be¬ 
tween  the  reference  and  the  camera  coordinate  sys¬ 
tems,  and  (ii)  an  internal  camera  coordinate  sys¬ 
tem  to  image  transformation  that  typically  involves 
a  linear  (affine)  transformation  and  non-linear  lens 
distortion  parameters.  Our  approach  to  the  precise 
alignment  problem  combines  the  external  coordinate 
transformation  and  the  linear  internal  transforma¬ 
tion  into  a  single  3D  projective  view  transformation 
.  This  along  with  the  depth  image  and  the  non-linear 
distortion  parameters  completely  specifies  the  align¬ 
ment  transformation  between  the  video  pixels  and 
those  in  the  reference  imagery.  It  is  to  be  emphasized 
that  one  main  advantage  of  our  approach  is  that  no 
explicit  camera  calibration  parameters  need  be  spec¬ 
ified.  This  aspect  tremendously  increases  the  scope 
of  applicability  of  our  proposed  system  to  fairly  arbi¬ 
trary  video  camera  platforms.  The  modeled  video- 
to-reference  transformation  is  applied  to  the  solu¬ 
tion  of  the  precise  alignment  problem.  The  process 
involves  simultaneous  estimation  of  the  unknown 
transformation  parameters  as  well  as  the  warped  ref¬ 
erence  imagery  that  precisely  aligns  with  the  video 
imagery.  Multi-resolution  coarse-to-fine  estimation 
and  warping  with  Gaussian/Laplacian  pyramids  is 
employed. 

Once  the  indexing  and  registration  steps  have  pre¬ 
cisely  located  a  video  mosaic  in  the  reference  image 


coordinates,  maintenance  of  the  alignment  need  not 
be  done  frequently  through  indexing.  The  align¬ 
ment  parameters  computed  between  video  frames 
may  be  combined  with  those  computed  between  a 
video  mosaic  and  the  reference  to  maintain  align¬ 
ment  between  the  video  and  reference  imagery. 

4.1  Formulation 

We  now  present  the  equations  used  for  align¬ 
ing  video  imagery  to  a  co-registered  reference  mage 
and  depth  image  The  formulation  used  is  the 
plane-fparallax  model  developed  by  [6,  11,  13].  The 
coordinates  of  a  point  in  a  video  image  are  denoted 
by  {x,y).  The  coordinates  of  the  corresponding 
point  in  the  reference  image  are  given  by  {Xr,Yr). 
Each  point  is  the  reference  image  has  a  parallax 
value  k.  The  parallax  value  is  computed  from  the 
dense  depth  image  which  is  co-registered  with  the 
reference  image. 

There  are  fifteen  parameters  oi...ai5  used  to 
specify  the  alignment. 

The  parameters  ai..a9  specify  the  motion  of  a 
virtual  plane. 

The  parameters  aio..ai2  specify  the  3D  parallax 
motion. 

The  parameter  ais  specifies  the  lens  distortion. 

The  parameters  a^.-ai^  specify  the  center  for  lens 
distortion. 


First  the  reference  image  coordinates  {Xr,Yr)  are 
projected  to  ideal  video  coordinates  {Xi,Yi): 


Xi  = 


n  = 


{ai  *  Xr  +  a2  *Yr  +  as  +  k  *  aip) 

(Uy  *  Xj-  -j-  G3  *  Yj.  -j-  Ug  -j-  A:  *  Uig) 

(u4  Xj-  a^  ^  Y.p  -j-  Ug  k  ^  Uii) 

{a-r  *  Xr  +  as  *Yr  +  ag  +  k  *  au) 


(2) 


Note,  since,  the  right  hand  side  in  the  above  two 
equations  is  a  ratio  of  two  expressions,  the  param¬ 
eters  ai..ai2  can  only  be  determined  up  to  a  scale 
factor.  We  typically  make  parameter  ug  =  1  and 
solve  for  the  remaining  11  parameters. 

^The  equations  for  aligning  video  imagery  to  a  co¬ 
registered  orthophoto  and  DEM  are  similar. 


The  ideal  video  coordinates  (X/,Y/)  are  related 


to  the  measured 

video  coordinates  {x,y)  by  the  fol- 

lowing  equation: 

X  — 

Xl  +  013  * 

{Xi  -  014)  * 

(3) 

y  = 

Yi  -t-  013  * 

{Yi  -  015)  * 

where 

(4) 

{Xr  —  014) 

^  +  {Yi-  015)' 

Note,  the  lens  distortion  parameters  ois-.ais  may 
be  computed  at  the  video  mosaicing  stage.  In  that 
case,  the  estimated  values  are  used.  However,  we 
have  also  implemented  the  above  system,  where  the 
lens  distortion  parameters  are  simultaneously  com¬ 
puted  with  the  projective  ai..a8  and  epipolar  param¬ 
eters  CL\Q--0,\2-  The  parallax  value^  [6,  11,  13]  k  at 
any  reference  location  is  calculated  from  the  depth 
^  at  that  location  using  the  following  equation: 

(5) 

z* 

where  z  and  <7^;  are  the  average  and  standard  devia¬ 
tion  of  the  depth  image  values. 

4.2  Pre-filtering 

The  reference  imagery  and  the  video  are  typically 
acquired  at  different  times.  Hence,  to  correlate  the 
video  to  the  reference  imagery,  we  do  the  follow¬ 
ing  transformations.  We  first  compute  and  match 
the  histograms  [4]  of  the  video  image  to  the  pre¬ 
dicted  piece  of  the  reference  image.  This  allows  us 
to  modify  the  video  image,  so  that  it  has  a  simi¬ 
lar  histogram  as  the  reference  image.  Finally,  we 
compute  the  laplacian  pyramids  of  the  reference  im¬ 
age  and  the  modified  video  image.  The  alignment 
parameters  are  computed  by  correlating  these  two 
images. 

4.3  Optimization 

To  register  the  video  image  to  the  reference  im¬ 
age,  we  use  a  hierarchical  direct  registration  tech¬ 
nique  [l].  This  technique  first  constructs  filter  pyra¬ 
mids  from  each  of  the  two  input  images,  and  then  es¬ 
timates  the  motion  parameters  in  a  coarse-fine  man¬ 
ner.  Within  each  level  the  Sum  of  squared  difference 
(SSD)  measure  integrated  over  user  selected  regions 
of  interest  is  used  as  a  match  measure.  This  measure 
is  minimized  with  respect  to  the  unknown  transfor¬ 
mation  parameters  ai..ai5.  The  SSD  error  measure 
for  estimating  the  transformation  parameters  within 
a  region  is: 

£;({A})  =^(/(x,t)-/(Ax),t-l))^  (6) 


®In  the  case  of  the  reference  image  being  an  orthophoto 
with  a  corresponding  DEM,  k  is  equal  to  the  DEM  value 


Figure  7:  One  frame  from  a  video  clip  captured  at 
a  highly  oblique  angle  with  respect  to  the  reference 
imagery.  The  reference  image  is  the  same  as  in  the 
nadir  view  case. 

Finally,  for  annotation  and  other  visualization 
tasks,  it  is  important  for  the  user  to  be  able  to  map 
points  from  the  video  to  the  reference  image  and 
vice  versa.  To  map  points  from  the  reference  image 
to  the  video,  we  use  equations  (3)  and  (4))  and  com¬ 
pute  the  values  on  the  right  hand  side.  However,  to 
map  a  video  point  to  the  reference  image,  we  solve 
the  equations  (3)  and  (4))  using  Newton’s  method. 
We  use  Newton’s  method  in  two  steps,  we  first  solve 
equation  (4)  and  then  use  the  results  of  that  to  solve 
equation  (3). 

Similarly  for  warping  the  video  image  to  the  refer¬ 
ence  image,  we  can  use  reverse  warping  with  bilinear 
interpolation.  However,  to  warp  the  reference  image 
to  appear  in  the  video  image  coordinates,  we  must 
use  forward  warping.  Point  mappings  in  the  forward 
warping  process  are  computed  using  the  above  tech¬ 
nique. 


where  x  =  {x,  y)  denotes  the  spatial  image  position 
of  a  point,  I  the  (Laplacian  pyramid)  image  inten¬ 
sity  and  (Ax  denotes  the  image  transformation  at 
that  point  (see  equations  (3)  and  (4)).  The  error  is 
computed  over  all  the  points  within  the  region.  The 
optimization  is  done  in  an  iterative  manner,  at  each 
level  of  the  pyramid  using  the  Levenbreg  Marquardt 
optimization  technique. 

4.4  Geo-mosaicing,  Mapping  points, 
Warping  Images 

Once,  the  alignment  parameters  have  been  com¬ 
puted,  the  video  images  can  be  warped  to  the  refer¬ 
ence  image.  These  video  images  can  then  be  merged 
to  construct  geo-mosaics  (geo-referenced  video  mo¬ 
saics).  These  mosaics  can  be  used  to  update  the 
reference  imagery.  We  show  examples  of  the  geo- 
referenced  video  mosaics  constructed  using  this  tech¬ 
nique  for  both  nadir  and  oblique  imagery  in  Figure 
8.  The  original  reference  image  and  depth  image  can 
be  seen  in  Figure  6.  The  oblique  video  image  can  be 
seen  in  Figure  7.  The  nadir  video  images  can  be  seen 
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Figure  6:  Left:  Reference  Image,  Right:  Digital  Elevation  Map. 


Figure  8:  Left:  Geo-registered  nadir  video,  and  Right:  Geo-registered  oblique  video  shown  overlaid  on  the 
reference  image. 


Figure  9:  Selected  points  overlaid  on  one  frame  of 
the  aerial  video. 

In  order  to  assess  the  approximate  accuracy  of 
geo-referencing,  a  few  points  were  manually  selected 
in  a  video  frame  and  the  corresponding  points  manu¬ 
ally  identified  in  the  reference  image.  Figure  9  shows 
the  selected  points  marked  with  -f’s  overlaid  on  the 
video  frame.  Points  in  the  reference  image  corre¬ 
sponding  to  those  in  the  video  were  also  identified 
using  the  geo-registration  algorithms.  Table  1  shows 
the  accuracy  of  located  points  in  the  reference  with 
respect  to  the  hand  selected  ones.  Second  and  third 
columns  in  the  table  show  the  coordinates  of  the  se¬ 
lected  video  points  and  the  subsequent  two  columns 
show  the  corresponding  points  selected  manually  in 
the  reference  image.  The  last  two  columns  show  the 
points  computed  in  the  reference  image  by  the  geo¬ 
registration  algorithm.  Most  correspondences  are 
within  1  pixel  accuracy  with  respect  to  the  manu¬ 
ally  selected  locations. 
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Abstract 

This  paper  describes  novel  algorithms  that  use 
absolute  camera  pose  information  to  identify 
correspondence  among  point  features  in  hun¬ 
dreds  or  thousands  of  images.  Our  incidence 
counting  algorithm  is  a  geometric  approach  to 
matching;  it  matches  features  by  extruding 
them  into  an  absolute  3-D  coordinate  system, 
then  searching  3-D  space  for  regions  into  which 
many  features  project. 

The  absolute  pose  estimates  reported  by  our 
instrumentation  are  accurate,  but  not  perfect. 
Thus,  we  also  consider  the  problem  of  refin¬ 
ing  these  pose  estimates,  given  feature  matches 
from  a  set  of  images.  We  describe  a  pose 
refinement  algorithm  which  decouples  transla¬ 
tion  (position)  estimates  from  rotation  (atti¬ 
tude)  estimates,  and  can  incorporate  matches 
from  many  hundreds  or  thousands  of  images. 

1  Introduction 

Many  3-D  reconstruction  algorithms  rely  on 
a  matching  or  correspondence  step  to  identify 
constraints  corresponding  to  the  scene  geome¬ 
try;  these  constraints  are  used  to  guide  the  3- 
D  reconstruction  process.  Typically,  matching 
is  performed  using  some  image  attribute  (e.g., 
pixel  luminance  [Gennery,  1977])  or  some  geo¬ 
metric  attribute  (e.g.,  length  and  orientation  of 
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edges  [Ayache,  1991]).  While  these  techniques 
work  well  for  images  taken  from  nearby  camera 
positions,  they  are  less  effective  for  disparate 
images  taken  from  cameras  that  are  far  from 
each  other. 

In  this  paper,  we  design  a  matching  algorithm 
that  uses  camera  pose  estimates  (provided  by 
physical  instrumentation)  to  over-constrain  the 
matching  problem,  identifying  matches  by  ap¬ 
plying  geometric  constraints  imposed  by  the 
camera  positions.  In  some  ways,  our  algorithm 
is  similar  to  use  of  the  epipolar  constraint  in 
stereo  vision  [Faugeras,  1993],  but  generalizes 
that  method  in  its  incorporation  of  many  cam¬ 
eras  and  images. 

As  the  absolute  pose  estimates  reported  by  our 
instrumentation  are  not  perfect,  we  also  con¬ 
sider  the  problem  of  camera  pose  refinement, 
i.e.,  computing  accurate  camera  poses  for  many 
images,  given  matches  between  points  and  fairly 
accurate  initial  pose  estimates. 

Much  of  the  existing  research  on  pose  refine¬ 
ment  has  revolved  around  the  assumption  that 
no  3-D  information  is  available  [Mohr  et  al., 
1995,  Faugeras,  1992].  The  basis  of  these  al¬ 
gorithms  is  the  epipolar  constraint  between  two 
images: 

fh^Fm'  =  0 

where  F  is  the  3x3  fundamental  matrix  re¬ 
lating  two  (projective)  points  m  and  m'  in  the 
two  images.  Determining  the  fundamental  ma¬ 
trix  is  equivalent  to  determining  the  (relative) 
poses  of  the  two  cameras  involved.  Given  eight 
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or  more  point  correspondences,  it  is  possible  to 
determine  the  fundamental  matrix  up  to  a  scale 
factor  using  the  eight  point  algorithm  [Longuet- 
Higgins,  1981].  However,  typical  algorithms 
[Faugeras,  1992,  Hartley,  1995]  use  more  points 
than  eight  in  order  increase  the  robustness  of 
the  algorithm. 

While  this  technique  performs  well  for  pairs  of 
images,  there  are  several  disadvantages  in  using 
the  fundamental  matrix  technique  for  a  large 
number  of  images.  First,  these  algorithms  in¬ 
volve  only  pairwise  matching;  using  them  to 
compute  pose  for  m  cameras  pairwise  may  re¬ 
sult  in  large  “drift”  error.  Second,  they  de¬ 
termine  camera  pose  and  3-D  positions  only 
up  to  a  projective  transformation,  which  needs 
to  be  “corrected”  as  a  post-processing  step. 
Third,  the  use  of  projective  matrices  increases 
the  complexity  of  the  solution  because  of  greater 
number  of  variables  and  more  complicated  con¬ 
straints  (such  as  singularity). 

In  contrast,  we  formulate  the  problem  as  a  di¬ 
rect  3-D  optimization  algorithm  that  refines 
initial  camera  pose  estimates.  One  advantage 
of  this  approach  is  that  the  number  of  un¬ 
known  variables  is  less,  increasing  the  robust¬ 
ness  of  the  algorithm.  Also,  the  algorithm  can 
seamlessly  incorporate  matches  across  many  im¬ 
ages.  Finally,  from  a  practical  standpoint,  it 
is  much  easier  to  visualize  and  debug  (using 
computer  graphics)  algorithms  operating  in  3- 
D;  this  would  be  much  harder  for  algorithms 
that  operate  in  more  complex  spaces. 

2  Incidence  Counting 

The  incidence  counting  algorithm  is  based  on 
the  following  property  of  projection:  if  any 
sparse  set  of  features  in  multiple  images  are  ex¬ 
truded  to  3-D,  then  it  is  likely  that  regions  of 
high  incidence  (regions  where  extrusions  from 
multiple  cameras  intersect)  correspond  to  real 
3-D  features.  Figure  1  illustrates  the  idea  of  the 
algorithm  in  2-D.  In  the  figure,  El,  E2,  and  E3 
are  cameras  imaging  three  features  (points  A, 
B,  C).  The  extrusions  of  the  image  features  are 
rays  originating  from  the  camera  and  passing 
through  the  feature.  If  a  feature  is  present  in 


Figure  1:  Incidence  counting  in  two  dimen¬ 
sions. 

k  images,  k  rays  would  intersect  at  the  location 
corresponding  to  the  feature  (e.g.,  points  A,  B, 
C  all  have  high  incidence  of  A:  =  3).  Thus,  a  sim¬ 
ple  way  to  identify  matches  would  be  search  for 
regions  with  high  incidence;  an  efficient  method 
to  perform  the  search  is  described  in  Section  2.1. 

Note  that,  in  addition  to  the  “true”  features, 
there  are  also  spurious  regions  with  high  inci¬ 
dence.  For  example,  even  though  point  D  was 
not  one  of  the  features  imaged  by  the  cameras, 
D  has  the  property  that  rays  from  all  three  cam¬ 
eras  pass  close  to  it;  i.e.,  D  is  a  possible  candi¬ 
date  for  a  match.  Section  2.2  provides  a  method 
to  eliminate  some  spurious  matches  by  associat¬ 
ing  an  error  value  with  each  3-D  position.  Fu¬ 
ture  work  will  incorporate  methods  using  image 
attributes  (color  and  texture)  to  eliminate  ad¬ 
ditional  spurious  matches. 

2.1  Octree-Based  Incidence 
Counting 

Our  algorithm  for  incidence  counting  requires 
two  parameters  in  addition  to  the  images  and 
camera  poses: 

•  €,  a  “nearness”  threshold.  This  is  neces¬ 
sary  to  handle  (small)  errors  in  either  the 
location  of  the  image  feature  or  in  camera 
pose.  The  choice  of  e  depends  on  both  the 
accuracy  of  the  camera  pose  estimate,  and 
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the  desired  accuracy  of  the  reconstruction. 

•  k,  the  incidence  threshold.  It  is  related  to 
the  density  of  camera  positions  relative  to 
the  features  of  interest  -  a  reasonable  value 
would  be  the  average  number  of  cameras 
imaging  a  feature. 

Given  these  parameters,  points  of  high  inci¬ 
dence  are  those  for  which  k  or  more  rays  pass 
by  within  a  distance  e. 

Possible  methods  of  identifying  high  incidence 
are  to  check  the  above  condition  for  (1)  all  k- 
cardinal  subsets  of  the  set  of  rays,  or  (2)  all  3-D 
points  in  a  discrete  set  (e.g.,  regular  grid).  Both 
these  methods  have  disadvantages.  Checking 
all  possible  subsets  suffers  from  a  combinato¬ 
rial  increase  in  complexity  with  A:;  checking  only 
a  discrete  set  of  3-D  points  suffers  from  the 
usual  problems  of  point  sampling  (i.e.,  miss¬ 
ing  some  3-D  feature  (undersampling),  or  in¬ 
efficiency  (oversampling)). 

Fortunately,  rays  constructed  by  extrusion  ex¬ 
hibit  the  clustering  property;  while  there  are  re¬ 
gions  of  high  density  (e.g.,  near  features),  there 
are  large  regions  containing  very  few  rays.  We 
exploit  this  property  by  constructing  an  octree 
[Samet,  1990]  to  store  the  rays.  The  octree  is 
constructed  by  associating  the  region  of  inter¬ 
est  (a  bounding-box  overestimate  of  the  cam¬ 
eras  and  the  scene  to  be  modeled)  with  the  root 
node,  and  subdividing  octree  nodes  until  either 
each  leaf  node  is  associated  with  fewer  than  k 
rays^,  or  its  dimensions  are  less  than  e.  Once 
the  octree  has  been  constructed,  each  leaf  node 
is  examined  to  check  whether  the  rays  through 
it  pass  through  within  e  of  each  other.  This  can 
be  performed  by  computing  the  (least-squares) 
best  point  lying  on  all  these  rays.  The  algorithm 
reports  all  the  points  (and  corresponding  rays) 
whose  error  is  less  than  e. 

2.2  Eliminating  Spurious  Matches 

As  mentioned  earlier,  one  drawback  of  the  in¬ 
cidence  counting  algorithm  is  that  it  identifies 

^We  associate  a  ray  with  an  octree  node  if  it  intersects 
the  e-extended  box  around  the  node. 


even  spurious  matches.  In  this  section,  we  de¬ 
sign  an  algorithm  to  eliminate  some  spurious 
matches  by  enforcing  the  constraint  that  a  sin¬ 
gle  ray  can  contribute  to  at  most  one  3-D  point. 
The  algorithm  given  below  uses  the  error  metric 
associated  with  a  3-D  point  to  choose  at  most 
one  3-D  point  for  each  ray.  Informally,  it  uses 
the  criteria  that  3-D  points  with  low  error  are 
retained,  and  those  with  high  error  are  rejected. 

Algorithm  Check-Spurious: 

1.  Sort  all  (say,  n)  high  incidence  3-D  points 
according  to  their  error  (the  lowest  error 
being  first).  Pi  denotes  the  i*^  3-D  point. 

2.  foreach  1  <  f  <  n  do 

(a)  if  (Pi  is  invalid)  continue; 

(b)  Output  Pi  as  a  valid  point. 

(c)  foreach  i  <  j  <  n  such  that  Pi  and 
Pj  share  a  ray,  mark  Pj  as  invalid. 

This  algorithm  also  has  the  property  that  it 
computes  the  minimum  error  valid  configura¬ 
tion  (in  a  lexicographic  sense). 


3  A  Direct  3-D  algorithm  for 
Camera  Pose  Refinement 


Figure  2:  Reconstruction  using  least  squares 
of  distances. 

We  now  present  an  algorithm  for  refining  cam¬ 
era  pose  estimates,  given  matches  across  points 
in  different  images.  Figure  2  illustrates  (in  2-D) 
the  idea  behind  our  algorithm.  If  the  camera 
poses  are  accurate,  then  the  rays  constructed  by 
extrusion  would  pass  through  the  reconstructed 
3-D  point.  Typically,  due  to  error  in  the  cam- 
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era  pose  estimates,  they  will  diverge  from  the 
reconstructed  point.  This  can  be  used  to  “cor¬ 
rect”  the  camera  poses  so  that  the  rays  are  as 
close  as  possible  to  the  3-D  reconstruction. 

Formally,  the  pose  refinement  problem  is  as  fol¬ 
lows.  Given: 

•  For  1  <  j  <  m,  E'i  and  R[  -  the  translation 
and  rotation  estimates  of  the  camera; 

•  For  1  <  «  <  m,  1  <  j  <  n,  rays  - 
unit  vectors  that  correspond  to  projections 
of  point  Pj  from  camera  i  (in  the  camera’s 
coordinate  system); 

compute  Ei,  iij  for  1  <  i  <  m  (the  true  pose  of 
each  camera),  and  Pj  for  1  <  j  <  n  (the  correct 
3-D  positions  each  matched  point). 

We  formulate  the  problem  as  a  minimization  of 
the  following  objective  function^: 

m  n 

i=l j=l 

Geometrically,  this  function  represents  the  sums 
of  the  squared  distances  from  reconstructed 
points  to  their  corresponding  rays  (Figure  2). 

As  the  objective  function  does  not  have  a  linear 
least-squares  form,  we  use  an  iterative  method 
to  solve  for  camera  pose.  Our  approach  is  to 
consider  the  problems  of  finding  each  trans¬ 
formation  independently  (assuming  the  other 
is  known  accurately)  and  combining  the  two 
methods  when  neither  translations  nor  rotations 
are  known  exactly.  While  this  is  equivalent  to 
minimizing  the  objective  function  using  partial 
derivatives  with  respect  to  translations  and  ro¬ 
tations,  it  is  helpful  to  separate  the  two  ca.ses 
for  clearer  presentation;  solutions  to  these  two 
cases  turn  out  to  be  quite  different. 


^Note  that  Pj  =  0  and  E,  =  0  is  a  trivial  solution 
to  the  minimization  problem.  This  can  be  avoided  by 
imposing  a  constraint  that  the  sum  of  their  magnitudes 
must  be  some  non-zero  constant.  In  practice,  due  to 
the  use  of  initial  pose  estimates,  we  have  found  that  the 
optimization  converges  to  non-trivial  solutions. 


3.1  Translations 

In  this  section,  we  solve  for  translations  of 
the  cameras,  assuming  that  their  rotations  are 
known  accurately.  Thus,  i?i(v,j)  can  be  re¬ 
placed  by  a  (known)  unit  vector  v'jj.  The 
resulting  objective  function  has  the  following 
form: 

m  n 

o  =  EEii(Pi-Ei)xv'jjf 

i=ij=i 

which  can  be  written  as: 

m  n 

c'  =  EEiii^'y(Pi-E*)ii' 

i=l j=l 

where  L'^  is  the  3x3  skew-symmetric  matrix 
defining  the  cross  product  whose  elements  are 
determined  by  the  components  of  v'^-: 

0  <j,3  -<j,2  ' 

— 0  Viji 

Kj,2  0  . 

Writing  ||x|p  =  x.x  as  x^x,  we  obtain: 

m  n 

0  =  E  E(P>  -  Ei)'"L'J.L'jj(Pj  -  Ei) 

i=l  j=l 

This  is  of  the  form  x^Ax  for  where  A  is  a  sym¬ 
metric  matrix.  The  derivative  of  this  function 
with  respect  to  x  is  Ax. 

Computing  the  derivatives  of  this  function  with 
respect  to  Pj,  and  setting  it  to  0  yields: 

m 

J^L%L'ij{Pj-Ei)  =  0 
Thus,  Pj  =  A~^b,  where 

m 

A  =  E<L'ii 

i=l 

m 

i=l 

Geometrically,  this  solution  gives  the  point  that 
minimizes  the  sum  of  squared  distances  of  Pj 
from  the  corresponding  rays. 
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Figure  3:  Translation  estimate  using  inverse 
rays. 

As  the  objective  function  is  symmetrical  in  Pj 
and  Ej,  setting  the  derivative  with  respect  to 
yields  the  equation  Ej  =  A“^c,  where 

c  =  X^L'5l',,P,- 
j=i 

This  is  equivalent  to  finding  the  3-D  point  that 
minimizes  the  sum-of-squared  distances  from 
the  “inverse”  rays  through  Pi . . .  P„  (Figure  3). 

The  translation  refinement  algorithm  alter¬ 
nately  computes  3-D  positions  and  camera 
translation  estimates  using  the  equations  given 
above^.  Convergence  in  the  algorithm  is  de¬ 
tected  by  little  change  in  the  objective  function. 

3.2  Rotations 

The  first  step  in  an  optimization  involving  un¬ 
known  rotations  is  to  choose  a  representation  for 
expressing  rotations.  A  variety  of  representa¬ 
tions  are  in  use:  orthonormal  matrices,  quater¬ 
nions,  Euler  angles,  etc.  [Foley  et  al.,  1990]. 
Each  of  these  representations  has  its  own  ad¬ 
vantages  and  disadvantages;  the  most  appro¬ 
priate  representation  depends  on  the  applica¬ 
tion  (e.g.,  quaternions  provide  closed  form  so¬ 
lutions  for  absolute  orientation  [Horn,  1987]). 

®The  solution  is  valid  only  up  to  a  rigid  (rotation, 
translation,  uniform  scaling)  transformation.  The  “cor¬ 
rect”  transformation  can  be  obtained  by  fixing  the  values 
of  some  three  points  in  absolute  coordinates. 


For  this  optimization,  we  chose  to  use  Euler  an¬ 
gles,  i.e.,  rotation  is  represented  by  three  ro¬ 
tations  about  the  coordinate  axes.  This  has 
the  advantage  that  no  additional  constraints 
are  needed  to  ensure  rotational  properties,  in 
contrast  to  the  orthonormality  constraint  for 
3x3  matrices  or  the  unit  length  constraint  for 
quaternions.  This  allows  use  of  simple  (uncon¬ 
strained)  non-linear  optimization  methods  such 
as  the  Newton-Raphson  method  [Scales,  1985] 
to  solve  for  the  rotation  parameters. 

Rotations  are  represented  as: 

R^(ri)R^(si)R"(ii) 

where  rj,  Sj,  ti  are  the  Euler  angles,  and 
are  3x3  matrices  representing  rotations  about 
the  coordinate  axis.  For  example,  R^(0)  is  the 
matrix: 

cos  6  —  sin  9  0 

sin  6  cos  6  0 

0  0  1 

We  use  the  iterative  Newton-Raphson  method 
using  the  gradient  (a  vector  formed  by  the 
first  partial  derivatives)  and  Hessian  (a  matrix 
formed  by  the  second  partial  derivatives)  of  the 
objective  function  to  solve  for  the  camera  rota¬ 
tions  [Scales,  1985].  Given  initial  estimates  of 
r,  s,  t  for  some  camera  i  (subscripts  are  omitted 
for  clarity),  increments  Ar,  As,  At  are  defined 
by  the  gradient  and  the  Hessian: 


r  d'^O 

d^O 

d^O 

r  -SO.  1 

dr^ 

drds 

drdt 

Ar 

dr 

drds 

d^O 

ds^ 

d^O 

dsdt 

As 

= 

ds 

d^O 

d^o 

d^O 

At 

dO 

•  drdt 

dsdt 

J 

dt  - 

The  partial  derivatives  are  obtained  by  symbol¬ 
ically  difiFerentiating  the  objective  function  with 
respect  to  r,  s,t  and  evaluating  the  expressions 
using  the  current  values  of  r,s,t.  Some  of  the 
partial  derivative  expressions  are  listed  in  the 
appendix. 

Given  the  current  rotation  in  terms  of  r,s,t, 
the  rotation  refinement  algorithm  evaluates  the 
partial  derivative  expressions  and  computes 
Ar,  As,  At.  The  new  rotations  are  used  to 
update  the  3-D  positions  of  the  reconstructed 
points,  and  this  process  is  repeated  until  con¬ 
vergence. 
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4  Conclusion 

We  presented  the  incidence  counting  algorithm 
that  identifies  matches  using  only  the  geometric 
constraints  implied  by  camera  pose.  The  algo¬ 
rithm  performs  fairly  well  for  synthetic  images 
and  camera  pose  [Coorg  and  Teller,  1996],  but 
more  experiments  on  real  data  are  needed  to 
fully  evaluate  its  efficacy. 

We  also  presented  a  direct  3-D  algorithm  to 
refine  camera  pose  estimates  given  correspon¬ 
dences.  Our  algorithm  operates  directly  in  3-D 
and  can  easily  incorporate  matches  across  hun¬ 
dreds  or  thousands  of  images.  Results  of  this  al¬ 
gorithm  on  synthetic  data  (random  3-D  points, 
perturbed  camera  poses)  is  presented  in  [Coorg 
and  Teller,  1996];  we  plan  to  experiment  with 
real  data  when  our  pose-instrumented  platform 
is  operational. 
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A  Rotational  Partial  Derivatives 

We  only  list  the  partial  derivatives  with  respect 

to  t;  the  expressions  for  r  and  s  are  similar.  Let, 

)s9  —  sin  9  0 
n  9  cos  9  0 
0  0  0 

-E 

<  (R^(r)RS'(s)R"(t)vij) 
<(R"(r)RJ'(s)S^(t  +  |)v,,) 

<(S"(r  +  |)R^(s)S^(t+|)vi,) 

<  (R^(r)R2'(s)S"(t  +  7r)vij) 

Ev.-.v‘ 

2EV‘.V5+2X:v,.V;‘ 

j=l  J=1 


S^(0) 

— 

c 

c 

D. 

= 

Pi 

Vi 

— 

Di 

V* 

= 

D. 

■yTTt 

3 

— 

Di 

V** 

3 

= 

D; 

Then, 

do 

dt 

d'^O 

dt^ 
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Abstract 

The  core  of  multiple-view  geometry  is  governed  by 
the  fundamental  matrix  and  the  trilinear  tensor.  In 
this  paper  we  unify  both  representations  by  first  de¬ 
riving  the  fundamental  matrix  as  a  rank-2  trivalent 
tensor,  and  secondly  by  deriving  a  unified  set  of  oper¬ 
ators  that  are  transparent  to  the  number  of  views.  As 
a  result,  we  show  that  the  basic  building  block  of  the 
geometry  of  multiple  views  is  a  trivalent  tensor  that 
specializes  to  the  fundamental  matrix  in  the  case  of 
two  views,  and  is  the  trilinear  tensor  (rank-4  trivalent 
tensor)  in  case  of  three  views.  The  properties  of  the 
tensor  (geometric  interpretation,  contraction  proper¬ 
ties,  etc.)  are  independent  of  the  number  of  views 
(two  or  three).  As  a  byproduct,  every  two- view  al¬ 
gorithm  can  be  considered  as  a  degenerate  three-view 
algorithm  and  three-view  algorithms  can  work  with 
either  two  or  three  images,  all  using  one  standard  set 
of  tensor  operations.  To  highlight  the  usefulness  of 
this  paradigm  we  developed  two  applications.  A  novel 
view  synthesis  algorithm  that  starts  with  the  rank- 
2  tensor  and  seamlessly  move  to  the  general  rank-4 
trilinear  tensor,  all  using  one  set  of  tensor  operations. 
We  also  applied  it  to  a  camera  stabilization  algorithm. 

1  Introduction 

The  geometry  of  multiple  views  is  governed  by  cer¬ 
tain  multi-linear  constraints,  bilinear  for  pairs  of  views 
and  trilinear  for  triplets  of  views  —  all  other  multi¬ 
linear  constraints  (four  views  and  beyond)  are  spanned 
by  the  bilinear  and  trilinear  constraints.  The  bilinear 
constraint  determines  the  “fundamental  matrix”  and 
the  trilinear  constraints  determine  the  “trilinear  ten¬ 
sor”  .  The  fundamental  matrix  is  a  rank-2  3x3  matrix 
and  the  trilinear  tensor  is  a  rank-4  trivalent  tensor. 

‘This  work  was  partly  funded  by  ARPA  through  the  U.S. 
Office  of  Naval  Reseauch  under  grant  -  N00014-93-1-1202,  R&T 
Project  Code  4424341 — 01,  US-Israel  binational  science  foun¬ 
dation  94-00120,  eind  the  European  ACTS  project  AC074 
“VANGUARD” 


There  are  known  properties  of  the  fundamental  ma¬ 
trix,  there  are  known  properties  of  the  trilinear  tensor, 
and  there  are  known  connections  between  the  two  — 
for  instance  how  to  extract  the  fundamental  matrix 
from  the  trilinear  tensor.  There  are  algorithms  (for 
reconstruction,  view  synthesis,  camera  stabilization) 
that  are  defined  for  concatenation  of  pairs  of  views, 
and  there  are  algorithms  that  are  defined  for  concate¬ 
nation  of  triplets  of  views.  What  is  needed,  therefore, 
is  a  canonical  representation,  a  single  object  with  a 
standard  set  of  operators,  that  applies  uniformly  to 
pairs  or  triplets  of  views.  In  other  words,  the  unifi¬ 
cation  efforts  that  have  appeared  so  far  in  the  liter¬ 
ature  focus  on  the  transformation  groups  (projective, 
affine  and  Euclidean)  represented  by  the  camera  ma¬ 
trix,  leading  to  a  canonical  framework  [3,  10,  6]  for 
the  geometry  of  two  views.  Given  the  recent  progress 
on  multi-linear  tensorial  constraints  across  more  than 
two  views,  there  is  a  need  to  make  a  similar  unification 
attempt  but  now  across  the  temporal  axis  (number  of 
views),  rather  than  on  the  spatial  axis  (transformation 
groups). 

The  paper  has  two  main  results.  First,  we  establish 
a  set  of  operators  that  are  used  to  synthesize  tensors 
from  one  another.  Second,  we  derive  the  geometry 
of  two  views  using  those  operators  and  show  that  the 
familiar  fundamental  matrix  is  embedded  in  a  rank- 
2  trivalent  tensor  (of  27  coefficients).  We  show  that 
the  properties  of  the  rank-2  tensor  are  identical  with 
the  known  properties  of  the  rank-4  trilinear  tensor  (of 
three  distinct  views),  and  the  set  of  operators  apply 
uniformly  to  both  tensors.  As  a  result,  the  geome¬ 
try  of  multiple  views  is  governed  by  a  single  tensorial 
structure  with  a  standard  set  of  operators  and  is  uni¬ 
form  with  respect  to  the  number  of  views  —  the  only 
change  that  occurs  when  the  number  of  views  is  two  is 
that  the  rank  of  the  tensor  becomes  2  instead  of  4,  but 
this  does  not  have  an  effect  on  the  manner  in  which 
the  tensor  is  used  for  applications. 

Apart  from  the  theoretical  result,  we  show  practi- 
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cal  benefits  of  this  unification  step.  First  is  the  “cross- 
platform”  capability  of  algorithms  to  work  both  in  the 
case  of  two  and  three  views,  as  the  latter  is  simply  a 
generalization  of  the  former.  This  results  in  the  ability 
to  handle  freely  and  seamlessly  the  geometry  of  two 
and  three  images  in  a  single  framework.  Instead  of 
existing  two-views  algorithms  one  can  use  three-view 
based  algorithms,  taking  advantage  of  the  third  view, 
in  case  it  is  present,  but  working  with  two  images  as 
well  without  modification,  all  due  to  the  introduction 
of  the  rank-2  tensor.  To  demonstrate  these  properties 
we  present  two  applications  —  a  novel  view  synthesis 
algorithm  that  highlights  the  simple  handling  of  the 
geometry  of  two  or  three  images  and  a  video  stabi¬ 
lization  algorithm  that  works,  as  is,  with  two  or  three 
images. 

2  Background  and  Notations 

We  assume  that  the  physical  3D  world  is  repre¬ 
sented  by  the  3D  projective  space  (object  space) 
and  its  projections  onto  the  2D  projective  space 
defines  the  image  space.  If  x  e  varies  over  the 
object  space,  represented  by  a  tetrad  of  homogeneous 
coordinates,  and  p  is  its  projection  (represented 
by  a  triplet  of  coordinates),  then  there  exists  a  3  x  4 
matrix  A  satisfying  the  relation  p  =  Ax,  where  =  rep¬ 
resents  equality  up  to  scale  and  A  is  called  the  camera 
matrix.  Since  only  relative  camera  positioning  can  be 
recovered  from  image  measurements,  the  first  camera 
matrix  can  be  represented  by  [/;0]. 

We  will  occasionally  use  tensorial  notations,  which 
are  briefly  described  next.  We  use  the  covariant- 
contravariant  summation  convention:  a  point  is  an 
object  whose  coordinates  are  specified  with  super¬ 
scripts,  i.e.,  p'  =  (p\p^,...).  These  are  called  con- 
travariant  vectors.  An  element  in  the  dual  space 
(representing  hyper-planes  —  lines  in  F*^),  is  called 
a  covariant  vector  and  is  represented  by  subscripts, 
i.e.,  Sj  =  (si,  S2,  ••••)•  Indices  repeated  in  covari¬ 
ant  and  contravariant  forms  are  summed  over,  i.e., 
p's,-  =  P^si  +  p^S2  +  •••  +  This  is  known  as 

a  contraction.  Vectors  are  also  called  1-valence  ten¬ 
sors.  2- valence  tensors  (matrices)  have  two  indices 
and  the  transformation  they  represent  depends  on 
the  covariant-contravariant  positioning  of  the  indices. 
When  viewed  as  a  matrix  the  row  and  column  posi¬ 
tions  are  determined  accordingly:  in  and  aj,  the 
index  i  runs  over  the  columns  and  j  runs  over  the 
rows,  thus  bja{  =  cf  is  BA  =  C  in  matrix  form.  An 
outer-product  of  two  1-valence  tensors  (vectors),  o,&^, 
is  a  2- valence  tensor  c]  whose  i,j  entries  are  ailf  — 
note  that  in  matrix  form  C  =  ba^ .  An  n- valence  ten¬ 
sor  described  as  an  outer-product  of  n  vectors  is  a 


rank-1  tensor.  Any  n-valence  tensor  can  be  described 
as  a  sum  of  rank-1  n-valence  tensors.  The  rank  of 
an  n-valence  tensor  is  the  smallest  number  of  rank-1 
n-valence  tensors  with  sum  equal  to  the  tensor.  For 
example,  a  rank-1  trivalent  tensor  is  aibjCk  where  ai,bj 
and  Cjfe  are  three  vectors.  The  rank  of  a  trivalent  tensor 
aijk  is  the  smallest  r  such  that, 

r 

e^ijk  —  ^  ^is  s^ks  •  (^) 

5  =  1 

The  tensor  of  vector  products  is  denoted  by  eijk  (in¬ 
dices  range  1-3)  operates  on  two  contravariant  vectors 
of  the  2D  projective  plane  and  produces  a  covariant 
vector  in  the  dual  space  (a  line):  eijkp’‘q^  =  s*,  which 
in  vector  form  is  s  =  p  x  q,  i.e.,  s  is  the  vector  product 
of  the  points  p  and  q. 

Two  views  p  =  [l;0]a;  and  p'  =  Ax  are  known  to 
produce  a  bilinear  matching  constraint  whose  coeffi¬ 
cients  are  arranged  in  a  3  x  3  matrix  F  known  as  the 
“Fundamental  matrix”  of  [2]  described  in  the  setting 
of  projective  geometry  (uncalibrated  cameras): 

fij  =  (ikiv'^^aj  (2) 

where  A  =  [a; «']  (oj  is  the  left  3x3  minor  of  A,  and  v' 
is  the  fourth  column,  the  epipole,  of  A).  The  bilinear 
constraint  is  fijP^p'^  =  0. 

Three  views,  p  =  [/;0]a!,p'  =  Ax  and  p"  =  Bx, 
are  known  to  produce  four  trilinear  forms  whose  coef¬ 
ficients  are  arranged  in  a  tensor  representing  a  bilinear 
function  of  the  camera  matrices  A,  B: 

a^^  =  vn\-v"^a{  (3) 

where  B  =  [6,  u"].  The  four  trilinear  constraints  are: 

P^Sjr^ai’’  =  0  (4) 

where  Sj  are  any  two  lines  (sj  and  sj)  intersecting 

at  p',  and  are  any  two  lines  intersecting  p".  Since 
the  free  indecis  are  p,  p  each  in  the  range  1,2,  we  have 
4  trilinear  equations  (which  are  unique  up  to  linear 
combinations).  By  changing  the  order  of  the  views  one 
can  obtain  at  most  12  trilinear  constraints  arranged 
in  three  such  tensors.  These  constraints  first  became 
prominent  in  [8]  and  the  underlying  theory  has  been 
studied  intensively  since  in  [11,  5,  4]. 

The  elements  of  satisfy  certain  properties.  The 
algebraic  relations  among  the  elements  are  described 
in  [4],  and  contraction  properties  in  [11].  Among  the 
contraction  properties  it  will  be  useful  for  later  to  men¬ 
tion  that  bkOp^^  (for  any  vector  ^)  produces  a  2D  pro¬ 
jective  transformation  (a  homography)  from  image  1 
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to  2  via  some  plane  of  reference,  and  produces  a 
homography  matrix  from  image  1  to  3  via  some  plane. 
This  homography  matrices  can  be  used  to  recover  the 
fundamental  matrix  or  the  epipole.  Finally,  it  has 
been  recently  shown  in  [9]  that  the  rank  of  is  4 
(we  will  return  to  tensor  ranks  later). 

3  The  Basic  Tensorial  Operators 

The  basic  operation  described  next  describes  the 
transformation  a  tensor  of  three  views  undergoes  when 
one  of  the  cameras  changes  its  position.  In  other 
words,  this  operator  can  be  used  to  create  a  chain  of 
tensors,  each  created  from  its  predecessor  in  the  chain. 
Later,  we  will  start  from  a  Null  tensor  (all  three  views 
are  repeated)  and  create  a  chain  that  along  the  way 
creates  a  representation  of  two  views  as  a  rank-2  triva- 
lent  tensor. 


Theorem  1  Given  a  tensor  of  camera  positions 
[r.,0],  A,  B,  then  the  tensor  of  camera  positions 
[/;0],A,  C,  where  C  is  obtained  from  B  by  an  incre¬ 
mental  change  of  coordinates  R  and  translation  t  from 
the  position  of  the  third  camera  has  the  form: 


(5) 

Proof:  The  proof  is  beyond  the  scope  of  this  paper. 
Please  refer  to  [1].  [] 

Likewise,  if  we  apply  an  incremental  change  to  the 
position  of  the  second  camera,  rather  than  the  third, 
then  the  tensor  7^*  will  have  the  form: 


yj^  =  rl‘af-t^ai 


rf  d-  P bi 


(6) 


4  The  rank-2  Trivalent  Tensor  of  Two 
Views 

We  will  use  the  tensorial  operators  (eqns.  5  and 
6)  to  create  a  chain  starting  from  the  (Null)  tensor 
of  views  <  1,1,1  >  (all  three  views  are  repeated), 
to  tensor  of  views  <  1,2,1  >,  to  tensor  <  1,2,2  > 
and  finally  to  tensor  of  views  <  1,2,3  >.  Ail 
the  tensors  of  the  chain  are  trivalent  tensors,  and  of 
interest  are  the  tensors  that  represent  only  two  distinct 
views. 

The  tensor  of  views  <  1, 2, 1  >  can  be  derived 
from  the  Null  tensor  using  eqn.  6, 

=  ajOf+u'-’/f  (7) 

_  lijk 


by  the  incremental  motion  A  =  [a,v']  from  views  < 
1,1,1  >  to  views  <  1,2,1  >.  The  elements  of  the 
tensor  are  either  0  or  the  epipole  v' . 


Next,  we  apply  an  incremental  motion  of  the  the 
third  view  going  from  tensor  of  views  <  1,2, 1  >  to 
tensor  of  views  <  1, 2, 2  >.  The  incremental  motion  is 
again  A  —  [a,  v']  and  we  use  the  operator  described  in 
eqn  5  to  obtain: 

=  af(u'^j!)-u'^l'  (8) 

and  7I*  is  the  tensor  of  the  image  triplet  <  1,2,2  >. 
It  can  be  readily  verified  that  the  elements  of  are 
composed  of  the  fundamental  matrix  fij  =  eikiv'^Oj, 
—fij,  and  the  remaining  (nine)  elements  vanish.  In 
other  words,  we  have  derived  a  trivalent  tensor  repre¬ 
senting  the  geometry  of  two  views,  and  is  composed  of 
the  elements  of  the  fundamental  matrix.  The  follow¬ 
ing  theorems  and  corollary  are  proved  in  [1]. 

Theorem  2  The  rank  of  the  trivalent  tensor  of  two 
views, 

is  2. 

Theorem  3  the  tensor  shares  the  same  properties 
as  the  general  rank-f  tensor  of  three  views. 

A  byproduct  of  these  properties  is  that  we  can  char¬ 
acterize  the  family  of  rank-2  homography  matrices: 

Corollary  1  [c\y.F ,  where  c  =  (01,02,03)  is  a  general 
^-vector,  defines  a  family  of  homography  matrioes  from 
the  first  image  to  the  seeond  image  due  to  a  plane 
passing  through  the  oenter  of  projection  of  the  second 
camera. 

The  corollary  extends  the  result  of  [6]  that  [ri']  x  F 
is  a  homography  matrix  to  a  family  of  homography 
matrices  [c]xT  passing  thru  the  center  of  projection 
of  the  second  camera. 

Finally,  note  that  the  bilinear  constraint  follows 
from  in  the  same  manner  as  in  the  general  rank-4 
tensor:  p'sjrk'ij'^  =  0  describes  a  contraction  with  the 
point  p  in  the  first  view,  some  line  s  passing  through 
p'  and  some  other  line  r  passing  through  p'  as  well. 
Thus,  we  get  the  same  point-line-line  interpretation 
we  get  with  the  general  rank-4  tensor. 

The  last  tensor  in  the  chain  is  to  go  from  tensor 
of  views  <  1,2,2  >  to  the  general  tensor  of  views  < 
1,  2, 3  >.  This  can  be  readily  done  using  the  operator 
of  eqn.  5. 

To  conclude,  we  have  shown  the  basic  “building 
block”  of  stereo  vision  to  be  the  trilinear  tensor  of 
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three  cameras.  Every  other  object,  be  it  the  epipole 
or  the  fundamental  matrix,  is  merely  a  degenerate  case 
of  the  general  trilinear  tensor.  Since  camera  parame¬ 
ters  can  be  recovered  directly  from  the  trilinear  tensor, 
there  is  no  need  for  the  fundamental  matrix,  other 
than  to  serve  as  a  tool  for  constructing  the  rank-2 
trivalent  tensor  in  case  only  two,  rather  than  three, 
views  are  given.  As  a  result  algorithms  developed  un¬ 
der  the  three-view  paradigm  will  apply  to  all  camera 
configurations,  be  it  two  or  three  cameras. 

5  Applications 

This  section  presents  two  applications  to  highlight 
two  of  the  ideas  advocated  in  this  paper.  The  first  ex¬ 
ample  highlights  the  simple  and  uniform  way  to  treat 
tensors  (both  rank-4  and  rank-2)  in  order  to  obtain 
new  ones.  There  is  no  need  to  distinguish  between 
the  geometry  of  two  and  three  views.  Specifically 
we  present  an  image-based  rendering  algorithm  that 
starts  with  a  pair  of  images,  related  by  a  rank-2  tensor, 
and  generate  a  novel  view  by  seamlessly  moving  from 
the  rank-2  tensor  to  the  rank-4  tensor.  The  second 
application  demonstrates  the  generality  of  algorithms 
developed  in  tensor  context  -  they  act  the  same  both 
for  the  case  of  two  views  and  three  views.  We  show 
this  on  a  stabilization  algorithm  originally  developed 
in  the  three- view  framework. 

5.1  Novel  View  Synthesis 

Novel  view  synthesis,  also  referred  to  as  image- 
based  rendering,  aims  at  synthesizing  novel  views  of  a 
scene  from  a  given  pair  of  images,  without  first  recon¬ 
structing  the  3D  model.  This  method  can  be  faster 
and  more  accurate  to  compute  than  building  the  3D 
model  first.  The  trilinear  tensor  is  an  ideal  candidate 
for  image-based  rendering  system,  as  it  is  numerically 
stable  and  has  no  degenerate  configurations.  We  use 
the  basic  tensor  operators  described  earlier  and  the 
rank-2  trivalent  tensor  of  two  views  to  build  new  ten¬ 
sors.  Once  a  tensor  is  built  we  use  equation  4  to  re¬ 
project  the  novel  image. 

5.2  Video  Stabilization 

This  application  illustrates  the  “cross-platform”  ca¬ 
pability  of  three-view  algorithms.  As  an  example  we 
show  how  to  convert  a  three- view  stabilization  algo¬ 
rithm,  originally  presented  in  [7],  to  work  with  two 
images  only.  The  purpose  of  the  stabilization  algo¬ 
rithm  was  defined  to  cancel  rotation  between  succes¬ 
sive  frames.  The  original  paper  makes  use  of  the  fact 
that  the  tensor  is  composed  of  three  homography  ma¬ 
trices  to  establish  a  linear  relation  between  the  ele¬ 
ments  of  the  trilinear  tensor  to  those  of  the  rotation 
matrix,  in  case  the  cameras  are  calibrated  and  the 
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Figure  1 :  An  example  of  image  synthesis  using  optic-flow 
and  a  tensor.  The  original  images  are  (a)  and  (b),  the  rest 
of  the  images  are  synthesized  using  the  method  described 
in  the  paper. 
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(c)  (d) 


Figure  2:  The  original  two  images  ((a),(b)).  Average  of 
the  original  two  images  (c).  Average  of  the  two  images 
after  rotation  cancellation  (d). 

angles  are  small.  Since  both  the  rank-2  and  rank- 
4  tensors  can  be  decomposed  into  three  homography 
matrices,  the  algorithm  works  the  same  for  two  views 
and  three  views.  Figure  2  shows  the  two  input  im¬ 
ages  as  well  as  an  average  image  of  the  original  images 
and  an  average  image  of  the  two  images  after  rotation 
cancellation.  For  verification  we  compared  our  results 
with  the  three- view  algorithm,  by  adding  a  third  im¬ 
age  (not  shown  here).  The  recovered  rotation  angles 
differed  by  less  than  0.01  radians.  The  visual  result 
was  indistinguishable. 

6  Conclusion 

We  unified  two-view,  three-view  and,  as  a  result, 
multi-view  geometry  with  the  trilinear  tensor  as  the 
connecting  thread.  This  was  done  by  developing  a 
basic  tensorial  operator  that  describes  the  change  in 
the  tensor  elements  as  a  result  of  camera  motion  and 
using  it  to  create  a  chain  of  tensors  that  include  the 
epipole,  the  fundamental  matrix  -  as  a  rank-2  triva- 
lent  tensor,  and  the  rank-4  trilinear  tensor  in  a  single 
framework.  The  rank-2  tensor  of  two  views  and  the 
rank-4  tensor  of  three  views  share  the  same  properties 
and  are  governed  by  a  single  set  of  basic  tensorial  op¬ 
erators.  As  a  result  algorithms  developed  under  the 
three-view  paradigm  will  apply  to  all  camera  config¬ 
urations,  be  it  two  or  three  cameras.  Apart  from  the 
theoretical  result,  we  showed  two  practical  examples 


that  make  use  of  this  theory.  An  image-based  render¬ 
ing  application  that  uses  the  basic  tensorial  operators 
to  seamlessly  move  from  rank-2  tensor  (representing 
the  geometry  of  two  views)  to  rank-4  tensors  (repre¬ 
senting  the  geometry  of  three  views),  and  an  image- 
stabilization  algorithm  that  works  unchanged  for  two 
or  three  images. 
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Abstract 

In  this  paper,  the  problem  of  recovering  three- 
dimensional  information  from  multiple  images 
is  considered.  The  goal  is  to  build  a  system 
that  can  incrementally  process  images  acquired 
from  arbitrary  camera  positions.  Our  approach 
makes  use  of  both  the  geometric  constraints  in¬ 
herent  in  the  camera  configuration,  as  well  as 
the  structural  relationships  between  image  fea¬ 
tures.  The  correspondence  problem  is  analyzed 
directly  in  3D  through  multi-image  triangula¬ 
tion.  To  address  the  possibilities  of  false  fea¬ 
tures  and  spurious  correspondence,  every  initial 
match  is  modeled  as  a  hypothesis.  At  the  core 
of  our  system  is  a  state  machine  which  keeps 
track  of  matching  hypotheses  in  various  states 
of  certainty,  and  evolves  their  states  in  response 
to  new  evidence. 

1  Introduction 

Traditional  multi-image  reconstruction  systems, 
both  stereo-based  and  motion-based,  assume 
that  successive  images  are  taken  from  very  sim¬ 
ilar  viewpoints  and  are  processed  in  the  or¬ 
der  of  acquisition.  This  assumption  makes 
the  matching  of  features  in  successive  images 
robust,  since  the  appearance  of  the  features 
changes  very  little.  However,  enforcing  this  as¬ 
sumption  makes  it  difficult  to  achieve  the  pre¬ 
cision  possible  by  matching  features  in  views  of 

‘Funding  for  this  research  was  provided  in  part  by  the 
Advanced  Research  Projects  Agency  of  the  Department 
of  the  Defense  under  Ft.  Huachuca  contract  DABT63- 
95-C-0009. 


the  same  scene  taken  from  distant  viewpoints  at 
arbitrary  times.  Moreover,  from  an  engineering 
point  of  view,  it  is  very  limiting  to  assume  that 
the  thousands  of  images  required  for  a  large- 
scale  reconstruction  project  would  be  gathered 
as  a  nearly-continuous  image  stream. 

Our  goal  is  to  be  able  to  process  images  in  arbi¬ 
trary  order,  without  placing  assumptions  on  the 
order  of  acquisition  or  strong  constraints  on  the 
viewpoint.  We  exploit  two  other  sources  of  con¬ 
straints  to  reduce  the  feature  matching  ambi¬ 
guity  introduced  by  our  attempt  to  achieve  this 
goal.  First,  we  use  a  combination  of  sensors, 
including  GPS,  to  annotate  each  image  with  an 
estimate  of  its  acquiring  camera’s  position  and 
orientation.  Therefore,  we  can  estimate,  in  a  co¬ 
ordinate  system  common  to  all  images,  the  3D 
ray  where  a  feature  in  one  camera  view  lies.  Sec¬ 
ond,  we  limit  ourselves  to  urban  environments, 
in  which  there  is  an  abundance  of  line  and  ver¬ 
tex  features  that  are  relatively  easy  to  identify 
from  a  wide  range  of  viewpoints. 

Given  these  considerations,  we  have  designed  a 
method  for  the  recovery  of  3D  structure  from 
multiple  images  of  an  urban  scene.  The  al¬ 
gorithm  operates  by  establishing  long-baseline 
correspondences  between  3D  features  in  dis¬ 
tant  images.  However,  just  as  in  2D,  spuri¬ 
ous  matches  can  occur  in  3D.  The  occurrence 
of  false  matches  can  be  significantly  reduced  by 
supplementing  geometric  constraints  of  imag¬ 
ing  configuration  -  the  camera  pose  estimates  - 
with  knowledge  about  structural  relationship  of 
image  features  -  vertex  and  edge  relationships. 
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Still,  feature  detection  is  by  no  means  a  per¬ 
fect  process.  Each  matching  hypothesis  must  be 
supported  by  a  sufficient  number  of  consistent 
observations  before  it  can  be  confirmed.  A  state 
machine  has  been  developed  to  keep  track  of  the 
hypotheses  and  their  estimated  uncertainties. 

2  Previous  Work 

Over  the  years,  stereo  researchers  have  explored 
countless  ways  to  improve  the  performance  of 
stereo  algorithms.  A  primary  objective  is  to 
establish  reliable  correspondence  across  two  or 
more  images.  The  challenge  is  that  when  a  large 
area  must  be  searched  for  a  match,  the  potential 
for  spurious  matches  increases  also. 

In  response  to  this  problem,  researchers  first 
turned  to  coarse-to-fine  methods  [Crimson, 
1981],  [Terzopoulos,  1983].  In  these  systems, 
matching  begins  at  a  low  resolution  of  the  image 
in  order  to  cover  large  displacements.  Match¬ 
ing  then  proceeds  to  higher  resolutions  where 
results  from  lower  resolutions  are  used  to  con¬ 
strain  the  search.  This  class  of  method  cannot 
deal  with  significant  perspective  distortion  and 
occlusion  present  in  long  baseline  images. 

Other  researchers  advocated  using  multiple  im¬ 
ages  acquired  with  closely  spaced  cameras  as  a 
way  of  extending  the  baseline  of  analysis  while 
minimizing  false  matches  [Herman  and  Kanade, 
1986],  [Baker  and  Bolles,  1989].  By  exploiting 
the  temporal  coherence  of  very  short  baseline 
images,  stereo  correspondence  can  be  performed 
accurately  through  incremental  tracking  of  pix¬ 
els  or  features.  Although  these  methods  seem 
to  work  well,  they  are  dependent  on  the  tem¬ 
poral  coherence  of  the  input  for  reliable  feature 
tracking.  They  cannot,  for  example,  associate 
images  which  are  taken  at  very  different  times, 
but  which  contain  observations  of  identical  real- 
world  structures. 

Another  approach  is  to  utilize  the  structural 
relationship  between  image  features  to  resolve 
matches  [Lim  and  Binford,  1988],  [Horaud  and 
Skordas,  1989].  It  has  been  observed  that 
structural  properties  tend  to  be  more  invariant 
with  respect  to  viewing  changes  than  local  im¬ 
age/feature  properties.  The  problem  of  corre- 
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Figure  1:  Multi- image  triangulation 


spondence  then  becomes  a  problem  in  finding 
the  mapping  which  best  preserves  the  structural 
relationship.  Because  these  methods  often  as¬ 
sume  their  feature  extraction  process  as  ideal, 
they  tend  to  be  fragile  with  real  images. 

Recently,  new  algorithms  capable  of  analyzing 
long  baseline  inputs  have  been  proposed.  Be- 
dekar  and  Haralick  [1996]  describe  a  method  for 
Bayesian  triangulation  and  hypothesis  testing. 
A  major  drawback  of  their  work  is  that  they  do 
not  consider  the  possibility  of  spurious  matches. 
Collins  [1996]  present  a  space-sweep  approach  to 
multi-image  matching.  The  problem  with  this 
method  is  that  it  uses  a  constant  threshold  for 
rejecting  false  matches,  and  so  does  not  handle 
underlying  causal  factors  in  a  generic  fashion. 

3  Multi-Image  Triangulation 

The  basic  principle  underlying  the  recovery 
of  three-dimensional  information  from  two- 
dimensional  images  is  triangulation.  Suppose 
we  are  given  the  corresponding  image  positions 
of  a  3D  point  x  projected  onto  a  set  of  im¬ 
ages  Jj.  We  can  compute  the  3D  position  of  the 
point  by  finding  the  intersection  of  rays  pro¬ 
jected,  respectively,  from  camera  Cj  and  passing 
through  the  image  feature  m,  (Figure  1). 

Typically  the  rays  will  not  intersect  precisely  at 
one  point.  However,  a  well-fitting  point  x  can  be 
estimated  with  the  least  squares  method.  Our 
goal  is  to  minimize  the  sum  of  squared  distances 
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of  the  rays  to  point  x: 

I>(x)  =  +  bj  -  x)^(aiti  +  bi  -  x)  (1) 

i 

where  is  the  direction  of  the  ray  i,  and  bj  is 
an  arbitrary  point  on  the  ray  (usually  taken  to 
be  the  camera  position  Cj). 

Setting  dD(x)/dx  =  0,  we  get 

5:{a,af-lf(a,af-I)x  (2) 

=  ^  (a,af  -  I)^(a,af  -  I)b, 

i 

We  note  that  this  is  a  linear  system,  Ax  =  b. 
Using  singular  value  decomposition  the  matrix 
Y,i  -  I)  can  be  decomposed 

into 

^  (a,af  -  I)^(a.af  -  I)  =  UWU^  (3) 

i 

where  U  is  an  orthonormal  matrix  satifying 
=  and  W  is  a  diagonal  matrix  containing 
the  singular  values.  The  least  squares  estimate 
X  is  then 

X  =  UW-^U^  (a^af  -  I)^(aiaf  -  I)bi^ 

(4) 

The  residual  of  the  intersection  process  is  given 
by  £>(x)  in  Equation  (1). 

4  Matching  via  Triangulation 

Suppose  we  hypothesize  that  a  set  of  image  fea¬ 
tures  is  in  correspondence.  A  direct  method  for 
testing  the  hypothesis  would  be  to  apply  multi¬ 
image  triangulation  on  the  established  feature 
set,  and  examine  the  residual  of  the  least  square 
computation.  If  the  residual  is  greater  than  a 
certain  threshold,  there  is  no  single  3D  point 
near  which  all  of  the  rays  pass,  and  the  corre¬ 
spondence  hypothesis  can  be  rejected. 

However,  we  cannot  hastily  accept  any  intersec¬ 
tion  of  rays  as  a  match.  Figure  2  illustrates  a 
case  in  point.  In  the  figure,  the  rays  of  vertices 
oi  and  02  intersect  with  the  rays  of  vertices  6i 
and  62  by  accident.  By  themselves,  the  acciden¬ 
tal  intersections  could  be  interpreted  as  a  line 


Figure  2:  Two  spurious  matches 


floating  in  front  of  two  buildings.  This  is  clearly 
incorrect,  and  situations  like  this  are  not  un¬ 
common.  Whenever  multiple  images  are  taken 
with  a  camera  revolving  around  some  region  in 
space,  there  will  be  many  nearly-crossing  rays. 

Additional  observation  of  the  the  features  is 
needed  for  resolving  this  ambiguity.  Certain 
structural  properties,  for  example  adjacency, 
axe  invariant  with  respect  to  large  changes  in 
viewing  direction.  Connectivity  of  vertices  is  a 
useful  structural  property  in  this  regard.  For 
two  vertices  to  match,  we  require  that  at  least 
two  of  their  incident  edges  must  match  also.  In 
the  case  of  Figure  2,  we  test  whether  two  inci¬ 
dent  edges  of  oi  match  with  two  incident  edges 
of  61,  and  can  quickly  reject  this  configuration 
as  a  spurious  intersection.  To  follow  this  strat¬ 
egy,  we  need  to  determine  vertex  connectivity 
information  from  the  images. 

Unfortunately,  no  existing  feature  extraction  al¬ 
gorithm  is  perfect.  We  may  never  be  certain 
that  a  feature  detected  from  an  image  is  not  an 
artifact  of  the  extraction  process.  For  instance, 
occlusion  often  generates  incidental  features  like 
T-junctions.  Since  T-junctions  are  not  intrinsic 
to  any  real  3D  object,  their  presence  can  con¬ 
fuse  the  matching  process.  To  deal  with  these 
problems,  we  treat  matches  as  explicit  hypothe¬ 
ses  that  must  later  be  confirmed  by  additional 
matches.  We  also  estimate  our  level  of  certainty 
in  each  hypothesis  and  update  it  as  new  evi¬ 
dence  becomes  available. 
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4.1  The  Data  Structures 

We  list  here  five  types  of  data  structures  that 
are  relevant  to  our  algorithm.  The  first  two  are 
image  features,  and  the  last  three  are  matching 
hypotheses  in  increasing  states  of  certainty. 

•  2D  lines  -  are  extracted  by  fitting  lines  to 
the  output  of  an  edge  detector.  The  system 
constructs  2D  lines  only  for  the  purpose  of 
vertex  detection. 

•  2D  vertices  -  are  located  by  intersecting  2D 
lines  that  form  an  L-junction.  Vertices  are 
the  key  features  used  in  the  correspondence 
process.  Each  vertex  is  described  by:  1)  a 
label,  2)  an  image  position,  3)  the  number 
of  incident  lines,  and  4)  any  connected  ad¬ 
jacent  vertices. 

•  2D  hypothesis  -  is  a  list  of  matched  2D  ver¬ 
tices  with  a  combined  baseline  too  short  to 
produce  a  reliable  3D  estimate.  Each  2D 
hypothesis  is  described  by:  1)  a  label,  2) 
the  number  of  contributing  vertices,  3)  a 
list  of  matched  2D  vertices,  and  4)  baseline 
information. 

•  3D  hypothesis  -  is  a  list  of  matched  2D  ver¬ 
tices  with  a  3D  estimate,  but  a  number  of 
observations  insufficient  for  confirmation  as 
a  3D  model.  Each  3D  hypothesis  is  de¬ 
scribed  in  the  same  way  as  a  2D  hypothe¬ 
sis,  with  two  elements  of  additional  infor¬ 
mation:  5)  a  3D  estimate  of  the  feature’s 
position,  and  6)  an  estimate  of  the  recon¬ 
struction  error  in  3D. 

•  3D  element  -  is  a  confirmed  3D  hypothesis. 
Each  3D  model  is  described  in  exactly  the 
same  way  as  a  3D  hypothesis. 

4.2  The  Matching  Algorithm 

The  algorithm  maintains  a  set  of  hypotheses, 
and  evolves  the  state  of  each  hypothesis  after 
insertion  of  a  new  image. 

After  the  features  (lines  and  vertices)  of  a  new 
image  have  been  extracted,  the  algorithm  tries 
to  find  confirming  evidence  for  existing  hypothe¬ 
ses  among  any  newly  observed  2D  features.  The 


algorithm  first  attempts  to  reduce  reconstruc¬ 
tion  error  for  any  3D  element  for  which  a  new 
observation  is  found.  Next,  hypotheses  are  pro¬ 
cessed  in  order  of  most  to  least  evolved,  be¬ 
ginning  with  3D  hypotheses,  then  2D  hypothe¬ 
ses,  and  finally  unmatched  2D  vertices.  For 
each,  confirmatory  evidence  is  sought  among 
any  newly  identified  features. 

For  every  existing  element/hypothesis/vertex: 

1.  For  every  new  vertex,  we  project  a  ray  from 
the  new  camera  position  through  the  new 
vertex. 

2.  For  3D  elements  and  3D  hypotheses,  we 
find  the  shortest  distance  between  the  ray 
and  the  estimated  3D  position  of  the  el¬ 
ement/hypothesis.  If  this  distance  is  suffi¬ 
ciently  small,  we  check  to  see  if  at  least  two 
incident  edges  of  the  element/hypothesis 
match  edges  incident  to  the  new  vertex. 

3.  For  2D  hypotheses  and  unmatched  vertices, 
we  find  the  residual  resulting  from  inter¬ 
secting  this  ray  with  rays  from  all  matched 
vertices  in  the  2D  hypothesis,  or  the  sin¬ 
gle  ray  of  the  unmatched  vertex.  If  this 
residual  is  sufficiently  small,  we  check  for 
matching  adjacent  edges  as  in  Step  2. 

4.  We  link  the  current  model,  hypothesis  or 
vertex  with  the  new  vertex  that  gives  the 
best  match. 

Each  new  vertex  can  be  matched  with  more 
than  one  hypothesized  object.  Thus,  a  spurious 
match  will  not  affect  other  objects  in  the  sys¬ 
tem.  After  each  new  match  is  identified,  we  test 
for  these  possible  state  transitions  (Figure  3): 

•  3D  hypothesis  ->  3D  element 

if  the  number  of  observations  is  sufficient. 

•  2D  hypothesis  3D  element 

if  the  baseline  is  long  enough  and  the  num¬ 
ber  of  observations  is  sufficient. 

•  2D  hypothesis  3D  hypothesis 
if  the  baseline  is  long  enough. 

•  2D  feature  — )■  3D  hypothesis 
if  the  baseline  is  long  enough. 
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2D  Feature 


3H)=  3D  Hypothesis 


=  Object  Killed 


3E)=  3D  Element 


2H)=  2d  Hypothesis 


3D  space  (Figure  4).  That  is,  given  a  set  of  im¬ 
ages  annotated  with  estimates  of  6-DOF  pose, 
we  fix  our  attention  on  a  region  of  3D  space 
(the  dashed  box  in  Figure  4),  then  identify  those 
images  possibly  containing  observations  of  this 
region  from  a  distance  less  than  some  absolute 
threshold  (typically,  one  hundred  meters).  In 
the  figure,  this  set  of  images  is  represented  by 
bold  wedges.  These  images  are  inserted  in  ar- 
bitary  order  and  processed  as  described  above, 
producing  a  stateful  set  of  feature  hypotheses. 
The  region  of  interest  is  then  moved;  any  3D 
elements  no  longer  in  the  region  of  interest  are 
output,  and  the  set  of  relevant  images  is  co¬ 
herently  updated  to  contain  observations  of  the 
new  region  of  interest. 


Figure  3:  State  evolution  diagram 


Figure  4:  Images  are  associated  spatially,  not 
temporally. 

•  2D  feature  —>  2D  hypothesis 
if  the  baseline  is  not  long  enough. 

If  a  hypothesis  lingers  longer  than  permitted 
without  confirmatory  evidence,  it  is  “killed”  or 
deleted  from  the  set  of  active  hypotheses. 

5  Image  Insertion 

Above,  we  specified  the  processing  to  be  done 
for  each  newly  inserted  image.  Rather  than  in¬ 
sert  the  images  in  temporal  order  (the  order  in 
which  they  were  acquired),  we  process  images  in 
groups  according  to  whether  they  are  suspected 
to  have  observed  the  same  region  of  absolute 


6  Conclusion 

In  this  paper,  we  describe  a  method  for  match¬ 
ing  images  acquired  from  arbitrary  camera  posi¬ 
tions.  Rather  than  processing  images  in  tempo¬ 
ral  order,  we  process  images  by  grouping  them 
according  the  3D  regions  they  observe.  The 
method  operates  by  hypothesizing  3D  features, 
then  seeking  confirmatory  evidence  for  these 
features  in  successively  inserted  images.  This 
incremental  approach  seeks  to  evolve  feature  hy¬ 
potheses  by  amassing  a  sufficiently  large  num¬ 
ber  of  observations  which  agree  on  a  feature’s 
position  to  within  a  sufficiently  small  tolerance 
or  reconstruction  error. 
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Abstract 

Both  the  original  version  of  David  Lowe’s  influ¬ 
ential  and  classic  algorithm  for  tracking  known 
objects  and  a  reformulation  of  it  implemented 
by  Ishii  et  al.  rely  on  (different)  approximated 
imaging  models.  Removing  their  simplifying 
assumptions  yields  a  full  projective  solution 
with  significantly  improved  accuracy  and  con¬ 
vergence,  and  arguably  better  computation¬ 
time  properties. 

1  Introduction  and  History 

The  ability  to  track  a  set  of  points  in  a  moving  im¬ 
age  plays  a  fundamental  role  in  several  computer  vi¬ 
sion  applications  with  real  time  constraints  such  as  au¬ 
tonomous  navigation,  surveillance,  grasping,  and  ma¬ 
nipulation.  Often  some  geometrical  invariants  of  these 
points  (such  as  their  relative  spatial  positions,  in  the 
case  of  a  rigid  object)  are  known  in  advance.  Algebraic 
solutions  with  perspective  camera  models  have  been  pro¬ 
posed  for  several  variations  of  this  problem  (references 
suppressed  here).  However,  the  resulting  techniques  usu¬ 
ally  work  only  with  a  limited  number  of  points  and  are 
thus  sensitive  to  additive  noise  and  erroneous  match¬ 
ing.  Furthermore,  they  usually  depend  on  numerical 
techniques  for  finding  zeros  of  fourth-  or  fifth-degree 
polynomial  equations. 

Pioneering  work  by  Lowe  [9;  8;  7]  and  Gennery  [5]  ad¬ 
dressed  the  problem  in  a  projective  framework.  Lowe 
showed  that  the  direct  use  of  numeric  optimization  tech¬ 
niques  is  an  effective  way  to  overcome  the  lack  of  robust¬ 
ness  that  makes  the  traditional  analytical  techniques  in¬ 
feasible  in  practice. 

DeMenthon  and  Davis  [4;  lO]  and  Christy  and  Horaud 
[3]  propose  techniques  that  start  with  weak-  or  para- 
perspective  solutions,  respectively,  and  refine  them  it¬ 
eratively  to  recover  the  full  perspective  pose.  Phong, 
Horaud  et  al.  [ll]  showed  that  it  is  possible  to  decouple 
completely  the  recovery  of  rotational  pose  parameters 
from  their  translational  counterparts.  However,  unlike 
Lowe’s,  none  of  these  methods  is  easily  generalizable  to 

‘This  material  is  based  on  work  supported  by  the  Luso- 
American  Foundation,  Calouste  Gulbenkian  Foundation, 
JNICT,  CAPES  process,  BEX  0591/95-5,  NSF  IIP  grant 
CDA-94-01142  and  DARPA  contract  MDA972-92-J-1012. 


deal  with  uncalibrated  focal  length  or  objects  (scenes) 
with  internal  degrees  of  freedom. 


2  Lowe’s  Algorithm 


Lowe’s  original  algorithm  [9;  8;  7]  addresses  the  issue 
of  viewpoint  and  model  parameter  computation,  given 
a  known  3-D  object  and  the  corresponding  image.  It 
assumes  that  the  imaging  process  is  a  projective  trans¬ 
formation.  The  method  can  thus  be  used  to  identify 
the  pose  (translation  and  orientation  with  respect  to  the 
camera  coordinate  system)  of  a  local  coordinate  system 
affixed  to  an  imaged  rigid  object.  It  can  also  be  extended 
to  discover  the  values  of  other  parameters  such  as  the 
camera  focal  length  and  shape  parameters  of  non-rigid 
objects.  The  recovery  process  is  based  on  the  application 
of  Newton’s  method. 

Rather  than  solving  directly  for  the  vector  of  param¬ 
eters  in  the  nonlinear  system  r,  Newton’s  method  com¬ 
putes  a  vector  of  corrections  x  to  be  subtracted  from  the 
current  estimate  for  r  on  each  iteration.  If  r^)  is  the 
parameter  vector  for  iteration  i,  then: 

r(i+l)  ^  J.{i}  _  X.  (1) 

Given  a  vector  of  error  measurements  e  between  com¬ 
ponents  of  the  model  and  the  image,  we  want  to  solve 
for  a  correction  vector  x  that  eliminates  this  error: 


Jx  =  e. 


where:  J^,-  = 

OXj 


(2) 


The  equations  used  to  describe  the  projection  of  a 
three-dimensional  model  point  p  into  a  two-dimensional 
image  point  (u,  v)  are: 


(x,y,z)^  =  R(p-t), 

(u,u)  =  (^,^),  (3) 

z  z 

where  T  denotes  transpose,  t  is  a  3-D  translation  vec¬ 
tor  (defined  in  the  model  coordinate  frame)  and  R  is  a 
rotation  matrix  that  transforms  p  in  the  original  model 
coordinates  into  a  point  (x,  y,  z)  in  camera-centered  co¬ 
ordinates.  These  are  combined  in  the  second  equation 
with  the  focal  length  /  to  perform  perspective  projec¬ 
tion  into  an  image  point  {u,v). 

The  problem  is  to  solve  for  t,  R  and  possibly  /,  given 
a  number  of  model  points  and  their  corresponding  loca¬ 
tions  in  an  image.  In  order  to  apply  Newton’s  method. 
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we  must  be  able  to  calculate  the  partial  derivatives  of  u 
and  V  with  respect  to  each  of  the  unknown  parameters. 
Lowe  [8]  proposes  a  reparameterization  of  the  projection 
equations,  to  simplify  the  calculation  by  “express[ing] 
the  translations  in  terms  of  the  camera  coordinate  sys¬ 
tem  rather  than  model  coordinates” : 


{x',y\zY 
fy' 


z'+D, 


=  Rp, 

+  Dy). 


(4) 


The  variables  R  and  /  remain  the  same  as  in  the  pre¬ 
vious  transform,  but  vector  t  has  been  replaced  by  the 
parameters  Dy  and  D^..  The  two  transforms  are 
equivalent  when: 


t  = 


D.jz'+Dz) 

f 


Dyjz'+Dz) 

f 


(5) 


According  to  Lowe,  “in  the  new  parameterization,  Dx 
and  Dy  simply  specify  the  location  of  the  object  on  the 
image  plane  and  specifies  the  distance  of  the  object 
from  the  camera” .  To  compute  the  partial  derivatives  of 
the  error  with  respect  to  the  rotation  angles  {4>x,  4>y  and 
(f>x  are  the  rotation  angles  about  x,  y  and  z,  respectively), 
it  is  necessary  to  calculate  the  partial  derivatives  of  x,  y 
and  z  with  respect  to  these  angles.  Table  1  gives  these 
derivatives  for  all  combinations  of  variables. 


X 

y 

Z 

0x 

0 

-z' 

y' 

z' 

0 

-x' 

4’z 

-y' 

x' 

0 

Table  1:  The  partial  derivatives  of  re,  y  and  z  with  re¬ 
spect  to  counterclockwise  rotations  <p  (in  radians)  about 
the  coordinate  axes. 

Newton’s  method  is  carried  out  by  calculating  the  op¬ 
timum  correction  rotations  A<l)x,  ^4>y  to  be 

made  about  the  camera-centered  axes.  Given  Lowe’s 
parameterization,  the  partial  derivatives  of  u  and  v  with 
respect  to  each  of  the  seven  parameters  of  the  imaging 
model  (including  the  focal  length  /)  are  given  by  Table 
2. 


U 

V 

Dx 

1 

0 

Dy 

0 

1 

Dz 

-fc‘‘x' 

-fc^y' 

0x 

-fc^x'y' 

-fc  {z'  +  cy'^) 

4>v 

fc  {z'  +  cx''^) 

fc^x'y' 

(l>z 

-fey' 

fox' 

f 

ex' 

cy' 

Table  2:  The  partial  derivatives  of  u  and  v  with  respect 
to  each  of  the  camera  viewpoint  parameters  and  the  fo¬ 
cal  length,  according  to  Lowe’s  original  approximation. 
Here  c  = 

Lowe  then  notes  that  each  iteration  of  the  multi¬ 
dimensional  Newton’s  method  solves  for  a  vector  of  cor¬ 


rections 

X  =  [ADx,^Dy,  ADz,  A^x,A(f>y,A(j)zf'  ■  (6) 

Lowe’s  algorithm  dictates  that  for  each  point  in  the 
model  matched  against  some  corresponding  point  in  the 
image,  we  first  project  the  model  point  into  the  image 
using  the  current  parameter  estimates  and  then  measure 
the  error  in  the  resulting  position  with  respect  to  the 
given  image  point.  The  u  and  v  components  of  the  error 
can  be  used  independently  to  create  separate  linearized 
constraints.  Making  use  of  the  u  component  of  the  error, 
Eu,  we  create  an  equation  that  expresses  this  error  as  the 
sum  of  the  products  of  its  partial  derivatives  times  the 
unknown  error-correcting  values: 


dDx  ^  dDy  ^  dDz 

du  ,  ,  du  ^  ,  du  .  .  ^ 

-——A(j)x  +  4”  —  Eu- 

U(px  ^02/  ^02 


(7) 


The  same  point  yields  a  similar  equation  for  its  v 
component.  Thus  each  point  correspondence  yields  two 
equations.  As  Lowe  says:  “from  three  point  corre¬ 
spondences  we  can  derive  six  equations  and  produce  a 
complete  linear  system  which  can  be  solved  for  all  six 
camera-model  corrections” . 


3  Lowe’s  Approximation 


Lowe’s  formulation  assumes  that  and  Dy  are  con¬ 
stants  to  be  determined  by  the  iterative  procedure,  when 
in  fact  they  are  not  constants  at  all  —  they  depend  on 
the  location  of  the  points  being  imaged. 

Let  the  translation  vector,  the  rotation  matrix  and  the 
description  of  an  arbitrary  feature  in  the  object  frame  be 
denoted,  respectively,  by: 


[tx  liyi^ 

tzf> 

■  rii 

ri2 

nz 

r2i 

r-22 

rzz 

.  rzi 

^32 

rzz 

bi.P2, 

Psf  • 

Then  using  the  projective  transformation  formulated  in 
Eq.  (3)  the  new  parameters  D^,  Dy,  D^  are  given  by: 


Dz  =  -(r-3iG +r32ty  TrssG),  and  then: 

£)  —  f  rntx  +  Tl2ty  +  fistz 

*  (r3iPi  +  rz2P2  +  fzzPz)  +  Dz 

£)  —  _  f  Tzitx  +  r22ty  +  f23tz 

^  ~  ^  {rziPi  +  'rz2P2  +  rzsPz)  +  Dz 


(8) 


Dz  is  dependent  only  on  the  object  pose  parameters, 
but  Dx  and  Dy  are  also  a  function  of  each  point’s  co¬ 
ordinates  in  the  object  coordinate  frame.  It  is  therefore 
in  general  impossible  to  find  a  single  consistent  value 
either  for  D^  or  for  Dy.  In  the  general  case  both  these 
parameters  will  depend  on  the  position  of  each  individual 
object  feature.  They  are  not  constants  —  they  are  only 
the  same  for  those  points  for  which  rziPi  +  rz2P2  +  rzzPz 
has  the  same  value.  Therefore  we  can  not  use  D^  and  Dy 
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as  defined  in  Eq.  (4).  The  assumption  that  is  implicit 
in  Lowe’s  algorithm  as  published  is  that  the  corrections 
needed  for  the  translation  are  much  larger  than  those 
due  to  rotation  of  the  object.  However,  if  no  restrictions 
are  imposed,  the  coordinates  of  the  points  in  the  object 
coordinate  frame  p  can  assume  high  values.  Even  if  they 
do  not,  the  term  raipi  +  r32P2  +  rnay  change  sig¬ 
nificantly  (due  to  the  object’s  own  geometry)  and  affect 
the  estimation  process. 

Ishii’s  formulation  [fi]  contains  different  simplifica¬ 
tions.  See  [2]  for  details. 


4  Full  Projective  Solution 


Initially  define  x' ,  y'  and  z'  as  in  Lowe’s  formulation: 


(x',y',z')'^  =  .Rp. 


Model  the  image  formation  process  by  Eq.  (3).  Remove 
the  approximations  of  Lowe  and  Ishii  by  defining: 

—  ~i'>'ntx  +  ruty  +  ristz), 

Dy  =  -ir2itx  +  r22ty+r23tz),  (9) 

+1'32ty  +r33t2:). 


In  this  case  the  image  coordinates  of  each  point  are  given 
by: 


(u,v)  = 


x'  +  D', 
z'+D',  ’ 


/ 


y'+D'y 
z'  +  D', 


(10) 


The  partial  derivatives  of  u  and  v  with  respect  to  each 
of  the  six  pose  parameters  and  the  focal  length  are  given 
by  Table  3. 


u 

V 

d‘. 

fc 

0 

0 

fc 

-fa(? 

-fbe^ 

<px 

-fac^y' 

-fc  {z'  +  bcy') 

fc (z'  +  ac  x') 

fbe^x' 

<Pz 

-fey' 

fex' 

f 

ac 

be 

Table  3:  The  partial  derivatives  of  u  and  v  with  respect 
to  each  of  the  camera  viewpoint  parameters  and  the  focal 
length  according  to  our  full  projective  solution.  Here 
(a,&,c)  =  {x'+D'^,  y'+Dy,  p^). 

As  in  Lowe’s  formulation,  the  translation  vector  is 
computed  using  Eq.  (5),  with  D'^,  Dy  and  D'^  as  de¬ 
fined  in  Eq.  (9).  This  translation  vector  is  defined  in 
the  object  coordinate  frame.  The  minimization  process 
yields  estimates  of  D'^,  and  D'^,  which  are  the  result 
of  the  product  of  the  rotation  matrix  by  the  translation 
vector. 

A  numerically  equivalent  but  conceptually  more  ele¬ 
gant  way  of  looking  at  this  solution  is  through  a  redef¬ 
inition  of  the  image  formation  process,  so  that  rotation 
and  translation  are  explicitly  decoupled,  and  the  trans¬ 
lation  vector  is  defined  in  the  camera  coordinate  frame. 
Redefine: 

(a:,2/,^)^  =  Rp-l-t,  (11) 


then:  =  t, 

and  Eqs.  (9)  and  (10)  can  be  collapsed  into: 

f2/  +fy 


In  this  case,  the  least-squares  minimization  procedure 
gives  the  estimates  of  the  translation  vector  directly. 


5  Experimental  Results 

We  compare  the  three  algorithms  described  in  the  pre¬ 
vious  sections  with  extensive  experiments  on  synthetic 
data^.  This  paper  is  an  abbreviated  version  of  [2]  and 
[1],  which  can  be  consulted  for  more  details  and  many 
more  results. 

Our  experiments  take  the  imaged  object  to  be  the 
eight  corners  of  a  cube,  with  edge  lengths  equal  to  25 
times  the  focal  length  of  the  camera  (for  a  20  mm  lens, 
for  instance,  this  corresponds  to  a  half  meter  wide,  long 
and  deep  object).  The  parameters  explicitly  controlled, 
in  general,  were  the  depth  of  the  object’s  center  with 
respect  to  the  camera  frame  (^true))  measured  in  focal 
lengths,  and  the  magnitudes  of  the  translation  (t^^^)  and 
the  rotation  (r^j^ff)  needed  to  align  the  initial  solution 
with  the  true  pose.  We  tested  near,  intermediate  and  far 
viewing  situations.  The  other  nine  pose  and  initial  so¬ 
lution  parameters  are  in  general  sampled  uniformly  over 
their  whole  domain. 

For  each  test  we  compute  two  global  image-space  er¬ 
ror  me2isures,  assuming  known  correspondence  between 
image  and  model  features.  The  first,  called  Norm  of 
Distances  Error  (NDE),  is  the  norm  of  the  vector  of 
distances  between  the  positions  of  the  features  in  the 
actual  image  and  the  positions  of  the  same  features  in 
the  reprojected  image  generated  by  the  estimated  pose. 
The  second,  called  Maximum  Distance  Error  (MDE), 
is  the  greatest  absolute  value  of  the  vector  of  error  dis¬ 
tances.  Both  measures  are  always  expressed  using  the 
focal  length  as  length  unit. 

NDE  and  MDE  do  not  necessarily  indicate  how  close 
the  estimated  pose  is  from  the  true  pose.  We  also  record 
individual  errors  for  six  different  pose  parameters:  the 
errors  in  the  x,  y  and  z  coordinates  of  the  estimate  for 
the  actual  object  translation  vector,  measured  as  relative 
errors  with  respect  to  the  object’s  center  actual  depth 
{zt  rue))  ^nd  the  absolute  errors  in  the  estimates  for  the 
roll,  pitch  and  yaw  angles  of  the  object  frame  with  re¬ 
spect  to  the  camera,  measured  in  units  of  tt  radiaiis. 

For  each  of  the  eight  different  error  measures,  we 
compute  the  average,  standard  deviation,  and  median. 
Statistics  that  leave  out  the  tails  of  the  error  distribu¬ 
tions  are  included  to  be  fair  to  a  method  (if  any)  that  un¬ 
derperforms  in  a  few  exceptional  situations  but  is  better 
“in  general”.  For  more  error  measures  and  more  statis¬ 
tics  see  [2;  l]. 

5.1  Convergence  in  the  General  Case 

Here  we  compare  the  speed  of  convergence  and  final  ac¬ 
curacy  of  each  method  with  arbitrary  poses  and  initial 

Matlab  code  implementing  the  three  algorithms  is  avail¬ 
able  from  the  authors 
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conditions.  The  statistics  for  the  NDE,  based  on  13,500 
executions  per  method,  plotted  in  Fig.  1.  They  show 
that  for  most  poses  Lowe’s  original  approximation  con¬ 
verges  to  a  very  high  global  error  level  and  Ishii’s  ap¬ 
proximation  only  improves  the  initial  solutions  in  its  first 
iteration  and  diverges  after  that.  The  full  projective  so¬ 
lution,  on  the  other  hand,  converges  at  a  superexponen¬ 
tial  rate  to  an  error  level  roughly  equivalent  to  the  rel¬ 
ative  rounding  error  of  double  precision,  which  is  about 
1.11  X  10-1®. 

Even  taking  into  account  the  worst  data,  our  approx¬ 
imation  still  converges  superexponentially  to  this  maxi¬ 
mum  precision  level  —  the  bad  cases  only  slow  conver¬ 
gence  a  bit.  In  this  case  Lowe’s  original  algorithm  and 
(especially)  Ishii’s  approximation  tend  to  diverge,  yield¬ 
ing  some  solutions  worse  than  the  initial  conditions. 

The  statistics  for  the  errors  in  the  individual  pose  pa¬ 
rameters  make  the  superiority  of  the  full  projective  ap¬ 
proach  even  more  clear.  Fig.  2  exhibits  the  relative 
errors  in  the  value  of  the  x  translation.  Both  Lowe  s  and 
Ishii’s  algorithms  diverge  in  most  situations,  while  the 
full  projective  solution  keeps  its  superexponential  con¬ 
vergence.  Due  to  their  simplifications,  Lowe  s  and  Ishii  s 
methods  in  those  cases  are  not  able  to  recover  the  true 
rotation  of  the  object.  They  tend  to  make  corrections  in 
the  translation  components  to  fit  the  erroneously  rotated 
models  to  the  image  in  a  least-squares  sense,  generat¬ 
ing  very  imprecise  values  for  the  parameters  themselves. 
Ishii’s  approximation  tends  to  translate  the  object  as  far 
away  from  the  camera  as  possible,  so  that  the  repro¬ 
jected  images  for  all  points  are  collapsed  into  a  single 
spot  that  minimizes  the  mean  of  the  squared  distances 
with  respect  to  the  true  images.  Similar  results  were 
obtained  for  the  other  five  parameter-space  errors. 

5.2  Other  Conditions 

We  ran  many  tests  only  reported  in  brief  prose  here  (see 
[2]  for  details).  First  we  assume  that  the  pose  is  ap¬ 
proximately  known:  In  this  case  the  accuracy  of  Ishii’s 
approximation  is  much  improved  (predictably,  given  its 
semantics).  Instead  of  diverging,  now  it  converges  ex¬ 
ponentially  towards  the  rounding  error  lower  bound.  It 
is  still  dominated  by  the  full  projective  solution,  which 
converges  super-exponentially  (in  about  5  iterations)  for 
the  NDE  (and  also  for  all  other  error  metrics  tested.) 

Timing  tests  with  optimized  Matlab  code  show  that 
the  full  projective  solution  (which  has  only  four  floating 
point  operations  more  than  Lowe’s  method  in  the  inner 
loop)  takes  2.99%  to  4.21%  longer  than  Lowe’s  method. 
However,  the  standard  deviations  of  the  execution  times 
for  Lowe’s  solution  are  between  6%  and  130%  bigger  than 
those  of  the  full  projective.  Thus,  the  full  projective 
approach  may  be  more  suitable  for  hard  real  time  con¬ 
straints,  due  to  its  smaller  sensitivity  to  ill-conditioned 
configurations. 

We  tested  the  sensitivity  of  the  techniques  to  individ¬ 
ual  variations  in  the  depth  of  the  object  along  the  optical 
axis  and  in  the  magnitudes  of  the  translational  and  rota¬ 
tional  errors  in  the  initial  solution.  We  tested  the  effects 
of  gaussian  noise  with  zero  mean  and  controlled  stan¬ 
dard  deviation  added  to  the  coordinates  of  the  image 


features.  The  microstructure  of  all  these  results  is  in¬ 
teresting  and  explicable  [2]  but  the  message  is  the  same: 
the  full  projective  solution  performs  significantly,  quali¬ 
tatively  better  for  all  cases  when  any  solution  is  usable. 

5.3  Accuracy  in  Practice 

Would  the  projective  method  perform  better  in  a  practi¬ 
cal  situation?  The  noise  experiments  were  relevant,  but 
in  a  real  system  a  more  important  issue  is  establishing 
the  initial  conditions.  One  could  use  a  smoothing  fil¬ 
ter,  but  this  approach  is  very  dependent  on  application- 
specific  parameters,  such  as  the  sampling  rate  of  the 
camera,  the  bandwidth  of  the  image  processing  system 
as  a  whole,  the  positional  depth,  the  linear  speed  and 
the  angular  speed  of  the  tracked  object. 

We  follow  a  more  general  approach:  use  a  weaker  cam¬ 
era  model  to  generate  an  initial  solution  for  the  problem 
analytically,  and  then  use  the  projective  iterative  solu- 
tion(s)  to  refine  this  initial  estimate.  This  approach  was 
suggested  by  DeMenthon  and  Davis  [4],  who  introduced 
a  way  of  describing  the  discrepancy  between  a  weak  per¬ 
spective  solution  and  the  full  perspective  pose  with  a  set 
of  parameters  that  can  then  be  refined  numerically,  yield¬ 
ing  the  latter  from  the  former.  Let  pi  be  the  description 
of  the  i-th  model  point  in  the  model  frame  and 
be  the  corresponding  image,  1  <  i  <  n.  Then,  the  weak 
perspective  solution  proposed  in  that  paper  amounts  to 
solving  the  following  set  of  equations  (in  a  least-squares 
sense),  for  the  unknown  three-dimensional  vectors  I  and 

(Pi  -  Po)  •  I  =  Wi  -  uo,  l<i<n  QgN 

(pi  -  po)  •  J  =  -  uo,  l<i<n 

A  normalization  of  these  vectors  yields  the  two  first 
rows  of  the  rotation  component  of  the  transformation 
that  describes  the  object  frame  in  the  camera  coordinate 
system.  The  third  row  can  then  be  obtained  with  a  single 
cross  product  operation.  After  that,  the  recovery  of  the 
translation  is  straightforward. 

However,  this  simple  weak  perspective  approximation 
introduces  errors  that  increase  proportionally  not  only 
with  the  inverse  depth  of  the  object,  but  also  with  its 
“off-axis”  angle.  In  order  to  avoid  this  problem,  we  first 
preprocessed  the  image  to  simulate  a  rotation  that  puts 
the  center  of  the  object’s  image  in  the  intersection  of 
the  optical  axis  with  the  image  plane  (this  seems  to  be 
a  novel  wrinkle  in  this  context).  Let  the  center  of  the 
object  image  be  described  by  [u,u].  Then,  this  transfor¬ 
mation,  as  suggested  in  [12],  is  given  by: 


r 

di 

0 

—  JL 
d\ 

R  = 

U  V 

did2 

dl. 

d2 

— 

d\d2 

,  where: 

(14) 

u 

d2 

/  n  .  - 

V 

d2 

.  O  .  1 

dy  —  +  1,  d2  =  +v^  +  1. 


After  this  preprocessing,  we  applied  the  technique  de¬ 
scribed  by  Eq.  (13),  in  order  to  recover  the  “foveated” 
pose.  Then,  we  premultiplied  the  resulting  transforma¬ 
tion  by  the  inverse  of  the  matrix  defined  in  Eq.  (14), 
in  order  to  recover  the  original  weak  perspective  pose. 
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Figure  1:  Convergence  of  an  image-space  error  metric,  the  Norm  of  Distances  Error  (see  introduction  of  Section  5), 
with  respect  to  the  number  of  iterations  of  Lowe’s  (solid  line),  Ishii’s  (dotted  line),  and  our  full  projective  solution 
(dash-dotted  line).  Tests  performed  with  a  cube,  rotated  by  arbitrary  angles  with  respect  to  the  camera  frame. 


Figure  2;  Convergence  of  the  ratio  between  the  error  on  the  estimated  x  translation  and  the  actual  depth  of  the 
object’s  center,  with  respect  to  the  number  of  iterations  of  Lowe’s  (solid  line),  Ishii’s  (dotted  line),  and  our  full 
projective  solution  (dash-dotted  line).  Tests  performed  with  a  cube,  rotated  by  arbitrary  angles  with  respect  to  the 
camera  frame. 


which  was  used  as  the  initial  solution  for  the  iterative 
techniques  in  being  compared. 

The  only  controlled  parameter  left  was  the  actual 
depth  of  the  object’s  center  (^true)-  We  chose  nine  aver¬ 
age  values  for  it,  growing  exponentially  from  25  to  6,400 
focal  lengths.  The  noise  standard  deviation  was  set  at 
0.002  focal  lengths  (corresponding  roughly  to  a  512  x 
512  spatial  quantization).  The  number  of  iterations  of 
each  method  per  run  was  set  at  2,  allowing  a  real  time 
execution  rate  of  about  100  Hz.  For  each  average  value 
of  •^true)  2,500  independent  runs  of  each  technique  were 
performed. 

The  statistics  for  the  NDE,  depicted  in  Fig.  3,  show 
that  our  full  projective  solution  was  up  to  one  order  of 
magnitude  more  accurate  than  the  other  two  methods 
for  most  cases  in  which  the  distance  was  smaller  than 
1,000  focal  lengths  (about  20  m,  with  the  typical  focal 
length  of  20  mm).  For  distances  bigger  than  that,  the 
precision  of  the  weak  perspective  initial  solution  alone 
was  bigger  than  the  limitation  imposed  by  the  noise  and 
so  the  three  techniques  performed  equally  well. 

For  X  translation  error  (Fig.  4)  and  the  other  five 
parameter-space  errors,  we  found  that  all  the  techniques 
exhibit  parameter-space  accuracy  peaks  in  the  range  of 
50  to  400  focal  lengths.  When  the  object  is  too  close, 
the  quality  of  the  initial  weak  perspective  solution  de¬ 
grades  quickly.  When  the  object  is  too  far  away,  the 
noise  gradually  overpowers  the  information  about  both 
the  distance  (via  observed  size)  and  the  orientation  of 
the  object,  since  all  the  feature  images  tend  to  collapse 
into  a  single  point. 

Similar  results  were  obtained  when  the  number  of  it¬ 


erations  for  each  run  was  raised  to  5.  This  suggests  that 
our  solution  may  be  very  well  suited  for  indoor  appli¬ 
cations  in  which  it  is  possible  to  keep  a  safe  distance 
between  the  objects  of  interest  and  the  camera. 

6  Discussion  and  Conclusion 

This  note  formulates  a  full  projective  treatment  of  a 
pose-  or  parameter-recovery  algorithm  initially  pro¬ 
posed  by  Lowe  [7;  8;  9].  The  projective  formulation  is 
compared  with  formulations  by  Lowe  and  Ishii  [6]  that 
approximate  the  full  projective  case.  Many  experiments 
based  on  different  scenaria  are  presented  here,  and  more 
are  available  in  [2].  Our  experiments  indicate  that  a 
straightforward  reformulation  of  the  perspective  imaging 
equations  removes  mathematical  approximations  that 
limit  the  precision  of  Lowe’s  and  Ishii’s  formulations 
The  full  projective  algorithm  has  better  accuracy  with 
a  minimal  increase  in  terms  of  computational  cost  per 
iteration. 

The  full  projective  solution  is  very  stable  for  a  wide 
range  of  actual  object  poses  and  initial  conditions.  In 
some  particularly  extreme  scenaria,  our  approach  does 
suffer  from  numerical  stability  problems,  but  in  these 
situations  the  accuracy  of  Lowe’s  and  Ishii’s  approxima¬ 
tions  is  also  unacceptable,  with  errors  of  one  or  more  or¬ 
ders  of  magnitude  in  the  values  of  the  pose  parameters. 
We  believe  that  this  type  of  problem  is  a  consequence 
of  Newton’s  method  and  can  only  be  overcome  with  the 
use  of  more  powerful  numerical  optimization  techniques, 
such  as  trust  region  methods. 

In  scenaria  that  may  realistically  arise  in  applications 
such  as  indoor  navigation,  with  the  use  of  reasonable 
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Figure  3;  Sensitivity  of  an  image-space  error  metric,  the  Norm  of  Distances  Error  (see  introduction  of  Section  5), 
with  respect  to  the  actual  depth  of  the  object’s  center  (in  focal  lengths),  for  Lowe’s  (solid  line),  Ishn  s  (dotted 
line),  and  our  full  projective  solution  (dash-dotted  line).  Tests  performed  with  initial  solutions  generated  by  a  weak 
perspective  approximation. 


Figure  4-  Sensitivity  of  the  ratio  between  tne  error  on  tne  esiimaieu  x  ricuismtiun  ^ 

center  with  respect  to  the  actual  depth  of  the  object’s  center  (in  focal  lengths),  for  Lowe’s  (solid  line),  Ishn  s  (dotted 
line),  and  our  full  projective  solution  (dash-dotted  line).  Tests  performed  with  initial  solutions  generated  by  a  weak 

perspective  approximation. 


(weak  perspective)  initial  solutions  and  taking  into  ac¬ 
count  the  effect  of  additive  gaussian  noise  in  the  imag¬ 
ing  process,  the  full  projective  formulation  outperforms 
both  Lowe’s  and  Ishii’s  approximations  by  up  to  an  or¬ 
der  of  magnitude  in  terms  of  accuracy,  with  practically 
the  same  computational  cost. 
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Abstract 

This  paper  analyzes  the  conditions  when  a  dis¬ 
crete  set  of  images  implicitly  describes  scene  ap¬ 
pearance  for  a  continuous  range  of  viewpoints. 
It  is  shown  that  two  basis  views  of  a  static  scene 
uniquely  determine  the  set  of  all  views  on  the 
line  between  their  optical  centers  when  a  vis¬ 
ibility  constraint  is  satisfied.  Additional  basis 
views  extend  the  range  of  predictable  views  to 
2D  or  3D  regions  of  viewpoints.  A  simple  scan¬ 
line  algorithm  called  view  morphing  is  presented 
for  generating  these  views  from  a  set  of  basis  im¬ 
ages.  The  technique  is  applicable  to  both  cali¬ 
brated  and  uncalibrated  images. 

1  Introduction 

Image-based  representations  of  3D  scenes  are 
currently  being  developed  by  many  researchers 
in  the  computer  vision  and  computer  graph¬ 
ics  communities.  These  representations  encode 
scene  appearance  with  a  set  of  images  that 
are  adaptively  combined  to  produce  new  views 
of  the  scene.  At  the  heart  of  this  area  lies 
a  fundamental  question;  To  what  extent  can 
scene  appearance  be  modeled  with  a  sparse 
set  of  images?  Clearly,  the  images  provide 
scene  appearance  at  a  discrete  set  of  view¬ 
points  but  it  has  not  been  clear  if  a  more 
complete  coverage  of  viewspace  is  theoretically 
possible.  A  number  of  “view  synthesis”  tech- 

*The  support  of  the  Defense  Advanced  Research 
Projects  Agency,  and  the  National  Science  Founda¬ 
tion  under  Grant  No.  IRI-9530985  is  gratefully 
acknowledged. 


niques  have  been  developed  recently  [Chen  and 
Williams,  1993,  Laveau  and  Faugeras,  1994, 
McMillan  and  Bishop,  1995,  Beymer  and  Pog- 
gio,  1996]  to  extend  the  range  of  predictable 
views.  However,  those  methods  require  solv¬ 
ing  ill-posed  correspondence  tasks,  suggesting 
that  the  view  synthesis  problem  is  inherently 
ill-posed. 

As  a  foundation  for  work  in  this  area  we  feel  it 
is  necessary  to  answer  the  following  two  ques¬ 
tions.  First,  given  two  perspective  views  of  a 
static  scene,  under  what  conditions  can  new 
views  be  predicted?  Second,  which  views  are 
determined  from  a  set  of  basis  images?  In  this 
paper  we  show  that  a  specific  range  of  perspec¬ 
tive  views  is  theoretically  determined  from  two 
or  more  basis  views,  under  a  generic  visibility 
assumption  called  monotonicity.  This  result  ap¬ 
plies  when  either  the  relative  camera  configura¬ 
tions  are  known  or  when  only  the  fundamen¬ 
tal  matrix  is  available.  In  addition,  we  present 
a  simple  technique  for  generating  this  partic¬ 
ular  range  of  views  using  image  interpolation. 
Importantly,  the  method  relies  only  on  measur¬ 
able  image  information,  avoiding  ill-posed  corre¬ 
spondence  problems  entirely.  Furthermore,  all 
processing  occurs  at  the  scanline  level,  effec¬ 
tively  reducing  the  original  3D  synthesis  prob¬ 
lem  to  a  set  of  simple  ID  image  transformations 
that  can  be  implemented  efficiently  on  existing 
graphics  workstations.  The  work  presented  here 
extends  to  perspective  projection  previous  re¬ 
sults  on  the  orthographic  case  [Seitz  and  Dyer, 
1995]. 
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Figure  1:  The  monotonicity  constraint  holds 
when  00^1  >  0  for  all  pairs  of  scene 
points  P  and  Q  in  the  same  epipolar 
plane. 


We  begin  by  introducing  the  monotonicity  con¬ 
straint  and  describing  its  implications  for  view 
synthesis  in  Section  2.  Section  3  considers  how 
views  can  be  synthesized,  and  describes  a  sim¬ 
ple  and  efficient  algorithm  called  view  morph¬ 
ing  for  synthesizing  new  views  by  interpolating 
images,  under  the  assumption  that  the  relative 
geometry  of  the  two  cameras  is  known.  Sec¬ 
tion  4  investigates  the  case  where  the  images 
are  uncalibrated,  i.e.,  the  camera  geometry  is 
unknown.  Section  5  presents  extensions  when 
three  or  more  basis  views  are  available.  Sec¬ 
tion  6  presents  some  results  on  real  images. 


2  View  Synthesis  and  Monotonicity 

Can  the  appearance  from  new  viewpoints  of 
a  static  three-dimensional  scene  be  predicted 
from  a  set  of  basis  views  of  the  same  scene? 
One  way  of  addressing  this  question  is  to  con¬ 
sider  view  synthesis  as  a  two-step  process 
reconstruct  the  scene  from  the  basis  views  using 
stereo  or  structure-from-motion  methods  and 
then  reproject  to  form  the  new  view.  The  prob¬ 
lem  with  this  paradigm  is  that  view  synthesis 
becomes  at  least  as  difficult  as  3D  scene  recon¬ 
struction.  This  conclusion  is  especially  unfortu¬ 
nate  in  light  of  the  fact  that  3D  reconstruction 
from  sparse  images  is  generally  ambiguous  a 
number  of  different  scenes  may  be  consistent 
with  a  given  set  of  images;  it  is  an  ill-posed 
problem.  This  suggests  that  view  synthesis  is 
also  ill-posed. 


In  this  section  we  present  an  alternate  paradigm 
for  view  synthesis  that  avoids  3D  reconstruction 
and  dense  correspondence  as  intermediate  steps, 
instead  relying  only  on  measurable  quantities, 
computable  from  a  set  of  basis  images.  We  first 
consider  the  conditions  under  which  reconstruc¬ 
tion  is  ill-posed  and  then  describe  why  these 
conditions  do  not  impede  view  synthesis.  Am¬ 
biguity  arises  within  regions  of  uniform  intensity 
in  the  images.  Uniform  image  regions  provide 
shape  and  correspondence  information  only  at 
boundaries.  Consequently,  3D  reconstruction  of 
these  regions  is  not  possible  without  additional 
assumptions.  Note  however  that  boundary  in¬ 
formation  is  sufficient  to  predict  the  appear¬ 
ance  of  these  regions  in  new  views,  since  the 
region’s  interior  is  assumed  to  be  uniform.  This 
argument  hinges  on  the  notion  that  uniform  re¬ 
gions  are  “preserved”  in  different  views,  a  con¬ 
straint  formalized  by  the  condition  of  mono¬ 
tonicity  which  we  introduce  next. 

Consider  two  views,  Vo  with  respective 

optical  centers  Cq  and  Ci,  and  images  Jo  and 
7i.  Denote  CqCi  as  the  line  segment  connect¬ 
ing  the  two  optical  centers.  Any  point  P  in  the 
scene  determines  an  epipolar  plane  containing 
P,  Co,  and  Ci  that  intersects  the  two  images 
in  conjugate  epipolar  lines.  The  monotonicity 
constraint  dictates  that  all  visible  scene  points 
appear  in  the  same  order  along  conjugate  epipo¬ 
lar  lines  of  Jo  and  h .  This  constraint  is  used 
commonly  in  stereo  matching  because  the  fixed 
relative  ordering  of  points  along  epipolar  lines 
simplifies  the  correspondence  problem.  Despite 
its  usual  definition  with  respect  to  epipolar  lines 
and  images,  monotonicity  constrains  only  the 
location  of  the  optical  centers  with  respect  to 
points  in  the  scene — the  image  planes  may  be 
chosen  arbitrarily.  An  alternate  definition  that 
isolates  this  dependence  more  clearly  is  shown 
in  Figure  1.  Any  two  scene  points  P  and  Q 
in  the  same  epipolar  plane  determine  angles  6o 
and  01  with  the  optical  centers  Co  and  Ci. 
The  monotonicity  constraint  dictates  that  for 
all  such  points  0o  and  9i  must  be  nonzero  and 
of  equal  sign.  The  fact  that  no  constraint  is 
made  on  the  image  planes  is  of  primary  impor¬ 
tance  for  view  synthesis  because  it  means  that 
monotonicity  is  preserved  under  homographies, 
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Figure  2:  Although  the  projected  intervals  in 
/o  and  Zi  do  not  provide  enough  in¬ 
formation  to  reconstruct  Si,  and 
S3,  they  are  sufficient  to  predict  the 
appearance  of  Ig. 


i.e.,  under  image  reprojection.  This  fact  will  be 
essential  in  the  next  section  for  developing  an 
algorithm  for  view  synthesis. 

A  useful  consequence  of  monotonicity  is  that  it 
extends  to  cover  a  continuous  range  of  views  in- 
between  Vq  and  Vi .  We  say  that  a  third  view  Vg 
is  in-between  Vq  and  Vj  if  its  optical  center  Cj  is 
on  CqCi.  Observe  that  monotonicity  is  violated 
only  when  there  exist  two  scene  points,  P  and 
Q,  in  the  same  epipolar  plane  such  that  the  infi¬ 
nite  line  PQ  through  P  and  Q  intersects  CqCi. 
But  PQ  intersects  CqCi  if  and  only  if  it  inter¬ 
sects  either  CqCs  or  C^Ci.  Therefore  mono¬ 
tonicity  applies  to  in-between  views  as  well,  i.e., 
signs  of  angles  are  preserved  and  visible  scene 
points  appear  in  the  same  order  along  conju¬ 
gate  epipolar  lines  of  all  views  along  CqCi.  We 
therefore  refer  to  the  range  of  views  with  cen¬ 
ters  on  CqCi  as  a  monotonic  range  of  views- 
pace.  Notice  that  this  range  gives  a  lower  bound 
on  the  range  of  views  for  which  monotonicity  is 
satisfied  in  the  sense  that  the  latter  set  contains 
the  former.  For  instance,  in  Figure  1  mono¬ 
tonicity  is  satisfied  for  all  views  on  the  open  ray 
from  the  point  CqCi  flPQ  through  both  cam¬ 
era  centers.  However,  without  a  priori  knowl¬ 
edge  of  the  geometry  of  the  scene,  we  can  infer 
only  that  monotonicity  is  satisfied  for  the  range 
C^. 

The  property  that  monotonicity  applies  to  in- 
between  views  is  quite  powerful  and  is  sufficient 


to  completely  predict  the  appearance  of  the  visi¬ 
ble  scene  from  all  viewpoints  along  CqCi.  Con¬ 
sider  the  projections  of  a  set  of  uniform  Lam¬ 
bertian  surfaces  (each  surface  has  uniform  ra¬ 
diance,  but  any  two  surfaces  can  have  different 
radiances)  into  views  Vq  and  Vi.  Figure  2  shows 
cross  sections  Si,  S2,  and  S3  of  three  such  sur¬ 
faces  projecting  into  conjugate  epipolar  lines  Iq 
and  h-  Each  connected  cross  section  projects  to 
a  uniform  interval  (i.e.,  an  interval  of  uniform 
intensity)  of  Iq  and  li.  The  monotonicity  con¬ 
straint  induces  a  correspondence  between  the 
endpoints  of  the  intervals  in  Iq  and  li,  deter¬ 
mined  by  their  relative  ordering.  The  points  on 
Si,  S2,  and  S3  projecting  to  the  interval  end¬ 
points  are  determined  from  this  correspondence 
by  triangulation.  We  will  refer  to  these  scene 
points  as  visible  endpoints  of  Si,  S2,  and  S3. 

Now  consider  an  in-between  view,  Vg,  with  im¬ 
age  Ig  and  corresponding  epipolar  line  Ig.  As 
a  consequence  of  monotonicity.  Si,  S2,  and  S3 
project  to  three  uniform  intervals  along  Ig,  de¬ 
limited  by  the  projections  of  their  visible  end¬ 
points.  Notice  that  the  intermediate  image 
does  not  depend  on  the  specific  shapes  of  sur¬ 
faces  in  the  scene,  only  on  the  positions  of 
their  visible  endpoints.  Any  number  of  dis¬ 
tinct  scenes  could  have  produced  Jq  and 
Ii,  but  each  one  would  also  produce  the 
same  set  of  intermediate  images.  Hence, 
all  views  along  CqCi  are  determined  from  Iq 
and  Ii-  This  result  demonstrates  that  view 
synthesis  under  monotonicity  is  an  inherently 
well-posed  problem — and  is  therefore  much  eas¬ 
ier  than  3D  reconstruction  and  related  motion 
analysis  tasks  requiring  smoothness  conditions 
and  regularization  techniques. 

A  final  question  concerns  the  measurability  of 
monotonicity.  That  is,  can  we  determine  if  two 
images  satisfy  monotonicity  by  inspecting  the 
images  themselves  or  must  we  know  the  an¬ 
swer  a  priori!  Strictly  speaking,  monotonicity 
is  not  measurable  in  the  sense  that  two  images 
may  be  consistent  with  multiple  scenes,  some 
of  which  satisfy  monotonicity  and  others  that 
do  not.  However,  we  can  determine  whether  or 
not  two  images  are  consistent  with  a  scene  for 
which  monotonicity  applies,  by  checking  that 
each  epipolar  line  in  the  first  image  is  a  mono- 
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tonic  warp  of  its  conjugate  in  the  second  image. 
3  View  Morphing 

The  previous  section  established  that  certain 
views  are  determined  from  two  basis  views  un¬ 
der  an  assumption  of  monotonicity.  In  this  sec¬ 
tion  we  present  a  simple  approach  for  synthesiz¬ 
ing  these  views  based  on  image  interpolation. 
The  procedure  takes  as  input  two  images,  lo 
and  h,  their  respective  projection  matrices,  Ho 
and  III,  and  a  third  projection  matrix  11*  rep¬ 
resenting  the  configuration  of  a  third  view  along 
CqCi.  The  result  is  a  new  image  7*  represent¬ 
ing  how  the  visible  scene  appears  from  the  third 
viewpoint. 

We  begin  with  a  special  case  where  the  im- 
age  planes  are  parallel  and  aligned  with  CqCi. 
This  configuration  is  often  used  in  stereo  appli¬ 
cations  and  will  be  referred  to  as  the  parallel 
configuration.  The  situation  is  expressed  alge¬ 
braically  using  the  projection  equations  as  fol¬ 
lows.  A  camera  is  represented  by  a  3  x  4  ho¬ 
mogeneous  matrix  11  =  [H  |  —  HC].  The  opti¬ 
cal  center  is  given  by  C  and  the  image  plane 
normal  is  the  last  row  of  H.  A  scene  point 
(A,  Y,  Z)  is  expressed  in  homogeneous  coordi¬ 
nates  asV  =  [XY  Z  if  and  an  image  point 
(x,  y)  by  p  =  [a:  y  1]^.  Because  homogeneous 
structures  are  invariant  under  scalar  multipli¬ 
cation,  sP  and  P  represent  the  same  point,  and 
similarly  for  sp  and  p.  We  therefore  reserve  the 
notation  P  and  p  for  points  whose  last  coordi¬ 
nate  is  1.  All  other  multiples  of  these  points 
will  be  denoted  as  P  and  p.  The  perspective 
projection  equation  is: 


p  =  np 


In  the  parallel  configuration,  the  projection  ma¬ 
trices  may  be  chosen  so  that  IIo  =  [I  |  —  Co]  and 
III  =  [I  I  -Cl],  where  I  is  the  3  x  3  identity  ma¬ 
trix.  Without  loss  of  generality,  we  assume  that 
Co  is  at  the  world  origin  and  CoCi  is  parallel 
to  the  world  A-axis  so  that  Ci  =  [Cx  0  0]  . 
Let  po  and  pi  be  projections  of  a  scene  point 
P  =  [A  y  Z  1]^  in  the  two  views,  respectively. 


P 


Figure  3:  The  three  steps  in  view  morphing: 

(1)  Original  images  7o  and  7i  are 
prewarped  (rectified)  to  be  parallel, 

(2)  is  is  produced  by  interpolation, 
and  (3)  7*  is  postwarped  to  form  7*. 

Linear  interpolation  of  po  and  pi  yields 

(l-s)po  +  spi  =  (1  -  s)— IIoP -H  s^IIiP 

= 

where 

n*  =  (1  -  s)no  +  sHi  (1) 

Image  interpolation,  or  morphing  [Beier  and 
Neely,  1992],  therefore  produces  a  new  view 
whose  projection  matrix,  11*,  is  a  linear  interpo¬ 
lation  of  IIo  and  IIi  and  whose  optical  center 
is  C*  =  [sCx  0  0]^.  Eq.  (1)  indicates  that 
in  the  parallel  configuration,  any  parallel  view 
along  C1C2  may  be  synthesized  simply  by  in¬ 
terpolating  corresponding  points  in  the  two  ba¬ 
sis  views.  In  other  words,  image  interpolation 
induces  an  interpolation  of  viewpoint  for  this 
special  camera  geometry. 

To  interpolate  general  views  with  projection 
matrices  Ho  =  [Hq  |  —  HqCo]  and  Hi  = 

[Hi  I  -  HiCi],  we  first  apply  homographies 
Hq  ^  and  to  convert  7o  and  7i  to  a  par¬ 
allel  configuration.  This  procedure  is  identical 
to  rectification  techniques  used  in  stereo  vision 
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[Robert  et  al.,  1995].  This  suggests  a  three-step 
procedure  for  view  synthesis: 

1.  Prewarp:  Jq  =  A  = 

2.  Morph:  linearly  interpolate  positions  and 
intensities  of  corresponding  pixels  in  /q  and 
Ji  to  form  Is 

3.  Postwarp:  =  Hj 

Rectification  is  possible  providing  that  the 
epipoles  are  outside  of  the  respective  image  bor¬ 
ders.  If  this  condition  is  not  satisfied,  it  is  still 
possible  to  apply  the  procedure  if  the  prewarped 
images  are  never  explicitly  constructed,  i.e.,  if 
the  prewarp,  morph,  and  postwarp  transforms 
are  concatenated  into  a  pair  of  aggregate  warps 
[Seitz  and  Dyer,  1996b].  The  prewarp  step  im¬ 
plicitly  requires  selection  of  a  particular  epipo- 
lar  plane  on  which  to  reproject  the  basis  images. 
Although  the  particular  plane  can  be  chosen  ar¬ 
bitrarily,  certain  planes  may  be  more  suitable 
due  to  image  sampling  considerations. 


4  Uncalibrated  View  Morphing 


In  order  to  use  the  view  morphing  algorithm 
presented  in  Section  3,  we  must  find  a  way  to 
rectify  the  images  without  knowing  the  pro¬ 
jection  matrices.  Towards  this  end,  it  can  be 
shown  [Seitz  and  Dyer,  1996a]  that  two  images 
are  in  the  parallel  configuration  when  their  fun¬ 
damental  matrix  is  given,  up  to  scalar  multipli¬ 
cation,  by 


F  = 


0  0  0 

0  0-1 
0  1  0 


We  seek  a  pair  of  homographies  Hq  and  Hj  such 
that  the  prewarped  images  Iq  =  and  Ii  = 

have  the  fundamental  matrix  given  by 
Eq.  (2).  In  terms  of  F  the  condition  on  Hq  and 
Hi  is 

Hi^FHo  =  F  (2) 


Solutions  to  Eq.  (2)  are  discussed  in  [Seitz  and 
Dyer,  1996a,  Robert  et  al.,  1995]. 


We  have  established  that  two  images  can  be 
rectified,  and  therefore  interpolated,  without 


knowing  their  projection  matrices.  As  in  Sec¬ 
tion  3,  interpolation  of  the  prewarped  images 
results  in  new  views  along  CqCi.  In  contrast  to 
the  calibrated  case  however,  the  postwarp  step 
is  underspecified;  there  is  no  obvious  choice  for 
the  homography  that  transforms  Ig  to  Is-  One 
solution  is  to  have  the  user  provide  the  homog¬ 
raphy  directly  or  indirectly  by  specification  of 
a  small  number  of  image  points  [Laveau  and 
Faugeras,  1994,  Seitz  and  Dyer,  1996b].  An¬ 
other  method  is  to  simply  interpolate  the  com¬ 
ponents  of  and  resulting  in  a  contin¬ 
uous  transition  from  /q  to  Ii  [Seitz  and  Dyer, 
1996a].  Both  methods  for  choosing  the  post¬ 
warp  transforms  generally  result  in  the  synthe¬ 
sis  of  projective  views.  A  projective  view  is  a 
perspective  view  warped  by  a  2D  affine  trans¬ 
formation. 

5  Three  Views  and  Beyond 

The  paper  up  to  this  point  has  focused  on 
image  synthesis  from  exactly  two  basis  views. 
The  extension  to  more  views  is  straightfor¬ 
ward.  Suppose  for  instance  that  we  have  three 
basis  views  that  satisfy  monotonicity  pairwise 
((/o,/i),  and  {h,l2)  each  satisfy  mono¬ 

tonicity).  Three  basis  views  permit  synthesis  of 
a  triangular  region  of  viewspace,  delimited  by 
the  three  optical  centers.  Each  pair  of  basis  im¬ 
ages  determines  the  views  along  one  side  of  the 
triangle,  spanned  by  CqCi,  C1C2,  and  C2C0. 

What  about  interior  views,  i.e.,  views  with  op¬ 
tical  centers  in  the  interior  of  the  triangle?  In¬ 
deed,  any  interior  view  can  be  synthesized  by 
a  second  interpolation,  between  a  corner  and  a 
side  view  of  the  triangle.  However,  the  assump¬ 
tion  that  monotonicity  applies  pairwise  between 
corner  views  is  not  sufficient  to  infer  monotonic¬ 
ity  between  interior  views  in  the  closed  trian¬ 
gle  AC0C1C2;  monotonicity  is  not  transitive. 
In  order  to  predict  interior  views,  a  slightly 
stronger  constraint  is  needed.  Strong  mono¬ 
tonicity  dictates  that  for  every  pair  of  scene 
points  P  and  Q,  the  line  PQ  does  not  inter¬ 
sect  AC0C1C2.  Strong  monotonicity  is  a  direct 
generalization  of  monotonicity;  in  particular, 
strong  monotonicity  of  AC0C1C2  implies  that 
monotonicity  is  satisfied  between  every  pair  of 
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views  centered  in  this  triangle,  and  vice-versa. 
Consequently,  strong  monotonicity  permits  syn¬ 
thesis  of  any  view  in  AC0C1C2. 

Now  suppose  we  have  n  basis  views  with  opti¬ 
cal  centers  Cq,  •  •  • ,  C„_i  and  that  strong  mono¬ 
tonicity  applies  between  each  triplet  of  basis 
views^  By  the  preceding  argument,  any  triplet 
of  basis  views  determines  the  triangle  of  views 
between  them.  In  particular,  any  view  on  the 
convex  hull  of  Cq,  . .  • ,  C„_i  is  determined, 
as  T-L  is  comprised  of  a  subset  of  these  triangles. 
Furthermore,  the  interior  views  are  also  deter¬ 
mined:  let  C  be  a  point  in  the  interior  of  'H  and 
choose  a  corner  Ci  on  U.  The  line  through  C 
and  Ci  intersects  ^  in  a  point  K.  Since  K  lies 
on  the  convex  hull,  it  represents  the  optical  cen¬ 
ter  of  a  set  of  views  produced  by  two  or  fewer  in¬ 
terpolations.  Because  C  lies  on  CjK,  all  views 
centered  at  C  are  determined  as  well  by  one 
additional  interpolation,  providing  monotonic¬ 
ity  is  satisfied  between  Cj  and  K.  To  establish 
this  last  condition,  observe  that  for  monotonic¬ 
ity  to  be  violated  there  must  exist  two  scene 
points  P  and  Q  such  that  PQ  intersects  CjK, 
implying  that  PQ  also  intersects  l-L.  Thus,  PQ 
intersects  at  least  one  triangle  ACjCjC*;  on 
violating  the  assumption  of  strong  monotonic¬ 
ity.  In  conclusion,  n  basis  views  determine  the 
3D  range  of  viewspace  contained  in  the  convex 
hull  of  their  optical  centers. 

This  constructive  argument  suggests  that  arbi¬ 
trarily  large  regions  of  viewspace  may  be  con¬ 
structed  by  adding  more  basis  views.  However, 
the  prediction  of  any  range  of  view-space  de¬ 
pends  on  the  assumption  that  all  possible  pairs 
of  views  within  that  space  satisfy  monotonicity. 
In  particular,  a  monotonic  range  may  span  no 
more  than  a  single  aspect  of  an  aspect  graph 
[Seitz  and  Dyer,  1996a],  thus  limiting  the  range 
of  views  that  may  be  predicted.  Nevertheless, 
it  is  clear  that  a  discrete  set  of  views  implicitly 
describes  scene  appearance  from  a  continuous 
range  of  viewpoints. 


Tn  fact,  strong  monotonicity  for  each  triangle  on  the 
convex  hull  of  Co,  • . . ,  C„-i  is  sufficient. 


6  Experimental  Results 

We  have  applied  the  view  morphing  algorithm 
to  many  pairs  of  basis  images,  two  of  which  are 
shown  in  Figure  4.  Each  pair  of  images  was 
uncalibrated  and  the  fundamental  matrix  was 
computed  from  several  manually-specified  point 
correspondences . 

The  first  pair  of  images  shows  two  views  of  a 
face.  A  sparse  set  of  user-specified  feature  cor¬ 
respondences  was  used  to  determine  the  corre¬ 
spondence  map  [Seitz  and  Dyer,  1996b].  The 
synthesized  image  represents  a  view  halfway  be¬ 
tween  the  two  basis  views.  Some  artifacts  occur 
in  regions  where  monotonicity  is  violated,  e.g., 
near  the  right  ear. 

The  second  pair  of  images  shows  a  wooden  man¬ 
nequin.  This  is  an  object  that  would  be  diffi¬ 
cult  to  reconstruct  due  to  lack  of  texture,  but 
is  relatively  easy  to  synthesize  views.  In  this 
example,  image  correspondences  were  automat¬ 
ically  determined.  Some  local  artifacts  are  vis¬ 
ible  where  monotonicity  is  violated  (e.g.,  left 
foot).  Blurring  is  caused  by  image  resampling, 
which  is  done  three  times  in  the  current  imple¬ 
mentation.  The  problem  may  be  ameliorated 
by  super-sampling  the  intermediate  images  or 
by  concatenating  the  multiple  image  transforms 
into  two  aggregate  warps  and  resampling  only 
once  [Seitz  and  Dyer,  1996b]. 

7  Conclusions 

In  this  paper  we  considered  the  question  of 
which  views  of  a  static  scene  may  be  predicted 
from  a  set  of  two  or  more  basis  views,  under 
perspective  projection.  The  following  results 
were  shown:  under  monotonicity,  two  perspec¬ 
tive  views  determine  scene  appearance  from  the 
set  of  all  viewpoints  on  the  line  between  their 
optical  centers.  Second,  under  strong  mono¬ 
tonicity,  a  volume  of  viewspace  is  determined, 
corresponding  to  the  convex  hull  of  the  opti¬ 
cal  centers  of  the  basis  views.  Third,  new  per¬ 
spective  views  may  be  synthesized  by  rectifying 
a  pair  of  images  and  then  interpolating  corre¬ 
sponding  pixels,  one  scanline  at  a  time,  using  a 
procedure  called  view  morphing.  Fourth,  view 
synthesis  is  possible  even  when  the  views  are 
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Figure  4:  Basis  views  (left  and  right)  of  a  face  (top)  and  mannequin  (bottom),  with  a  synthesized 
view  (center)  halfway  in-between  each  pair. 


uncalibrated,  provided  the  fundamental  matrix 
is  known.  In  the  uncalibrated  case,  the  synthe¬ 
sized  images  represent  projective  views  of  the 
scene. 

References 

[Beier  and  Neely,  1992]  T.  Beier  and  S.  Neely. 
Feature-based  image  metamorphosis.  In 
Proc.  SIGGRAPH  92,  pages  35-42,  1992. 

[Beymer  and  Poggio,  1996]  D.  Beymer  and 
T.  Poggio.  Image  representations  for  visual 
learning.  Science,  272:1905-1909,  1996. 

[Chen  and  Williams,  1993]  S.  Chen  and 
L.  Williams.  View  interpolation  for  image 
synthesis.  In  Proc.  SIGGRAPH  93,  pages 
279-288,  1993. 

[Laveau  and  Faugeras,  1994]  S.  Laveau  and 
O.  Faugeras.  3-D  scene  representation  as  a 
collection  of  images.  In  Proc.  12th  Int.  Conf. 
Pattern  Recognition,  pages  689-691,  1994. 


[McMillan  and  Bishop,  1995]  L.  McMillan  and 
G.  Bishop.  Plenoptic  modeling.  In  Proc.  SIG¬ 
GRAPH  95,  pages  39-46,  1995. 

[Robert  et  ai,  1995]  L.  Robert,  C.  Zeller, 
O.  Faugeras,  and  M.  Hebert.  Applications 
of  non-metric  vision  to  some  visually  guided 
robotics  tasks.  Technical  Report  2584, 
INRIA,  Sophia- Antipolis,  Prance,  June  1995. 

[Seitz  and  Dyer,  1995]  S.  Seitz  and  C.  Dyer. 
Physically-valid  view  synthesis  by  image  in¬ 
terpolation.  In  Proc.  IEEE  Workshop  on 
Representation  of  Visual  Scenes,  pages  18- 
25,  1995. 

[Seitz  and  Dyer,  1996a]  S.  Seitz  and  C.  Dyer. 
Scene  appearance  representation  by  perspec¬ 
tive  view  synthesis.  Technical  Report  1298, 
University  of  Wisconsin,  May  1996. 

[Seitz  and  Dyer,  1996b]  S.  Seitz  and  C.  Dyer. 
View  morphing.  In  Proc.  SIGGRAPH  96, 
pages  21-30,  1996. 


887 


Direct  methods  for  estimation  of  structure  and  motion  from 

three  views 


G.  P.  Stein 


A.  Shashua 


Artificial  Intelligence  Laboratory 
MIT 


Cambridge,  MA  02139 
gideon®  ai .  mit .  edu 


Institute  of  Computer  Science 
Hebrew  University  of  Jerusalem 
Jerusalem  91904,  Israel 
http://www.cs.huji.ac.i1/~  shashua/ 


Abstract 

We  describe  a  new  ’direct  method’  for  estimating 
structure  and  motion  from  image  intensities  of  mul¬ 
tiple  views.  We  extend  the  direct  methods  of  [5]  to 
three  views.  Adding  the  third  view  enables  us  to 
solve  for  motion,  and  compute  a  dense  depth  map 
of  the  scene,  directly  from  image  spatio-temporal 
derivatives  in  a  linear  manner  without  first  having  to 
find  point  correspondences  or  compute  optical  flow. 

We  describe  the  advantages  and  limitations  of  this 
method  and  show  experiments  with  real  images. 

1  Introduction 

We  present  a  new  method  for  computing  motion  and 
dense  structure  from  three  views.  This  method  can 
be  viewed  as  an  extension  of  the  ’direct  methods’  of 
Horn  and  Weldon  [5]  from  two  views  (one  motion) 
to  three  views  (two  motions).  These  methods  are 
dubbed  ’direct  methods’  because  they  do  not  require 
prior  computation  of  optical  flow.  We  assume  small 
image  motions  on  the  order  of  a  few  pixels. 

Applying  the  constant  brightness  constraint  [4]  to 
the  trilinear  tensor  of  Shashua  and  Werman  [9,  12] 
results  in  an  equation  relating  camera  motion  and 
calibration  parameters  to  the  image  gradients  (first 
order  only).  We  get  one  equation  for  each  point  in- 
the  image  with  a  fixed  number  of  parameters.  This 
results  in  a  highly  over-constrained  set  of  equations. 

This  method  has  advantages  over  both  optical  flow 
methods  [6] [7]  and  feature  based  methods  [12],  We 
combine  the  information  from  all  the  points  in  the 
image  and  thus  we  avoid  the  aperture  problem  which 
makes  computation  of  optical  flow  difficult.  We  do 
not  explicitly  define  feature  points.  Points  with 
small  gradients  simply  contribute  less  to  the  least 
squares  estimation.  Information  from  all  points  that 
have  gradients  is  used. 


®G.S  would  like  to  acknowledge  DARPA  contracts  N00014- 
94-01-0994  and  95009-5381.  A.S.  would  like  to  acknowl¬ 
edge  US-IS  BSP  contract  94-00120  and  the  European  ACTS 
project  AC074. 


These  advantages  are  highlighted  in  a  scene  with  a 
set  of  vertical  bars  in  front  of  a  set  of  horizontal 
bars  and  behind  everything  a  uniform  background. 
Optical  flow  methods  fail  because  of  the  straight  bars 
and  the  aperture  problem.  Discrete  methods  fail 
because  the  intersections  of  the  lines  in  the  image, 
which  are  detected  as  ’features’  do  not  correspond  to 
real  features  in  space.  Many  natural  scenes  such  as 
tree  branches  or  man  made  objects,  window  frames, 
lamp  posts  and  fences  give  rise  to  these  problems. 

Starting  with  the  general  uncalibrated  model  we  pro¬ 
ceed  through  a  hierarchy  of  reduced  models  first  by 
assuming  calibrated  cameras  and  then  by  assum¬ 
ing  the  Longuett-Higgins  and  Prazdny  small  motion 
model  [6].  We  then  show  how  to  solve  the  simplified 
model  for  the  motion  parameters. 

1.1  Previous  Work 

The  ’direct  methods’  were  pioneered  by  Horn  and 
Weldon  in  [5].  A  single  image  pair  results  in  N 
equations  in  iV  -|-  6  unknowns,  where  N  is  the  num¬ 
ber  of  points  in  the  image,  so  a  constraint  is  needed. 
Negahdaripour  and  Horn  [8]  present  a  closed  form 
solution  assuming  a  planar  or  quadratic  surface. 

This  work  is  based  on  the  work  of  Shashua  and 
Hanna  [11].  Here  we  describe  the  results  of  imple¬ 
menting  these  ideas  in  practice.  During  the  course 
of  implementation  various  subtleties  and  limitations 
were  discovered. 

2  Mathematical  Background 

2.1  Notations 

Let  ®  be  a  point  in  3D  Space  and  its  projection  in 
a  pair  of  images  be  p  and  p'.  Then  p  =  [7;  0]®  and 
p'  =  Ax,  the  left  3x3  minor  of  A  stands  for  a 
2D  projective  transformation  of  the  chosen  plane  at 
infinity  and  the  fourth  column  of  A  stands  for  the 
epipole  (the  projection  of  the  center  of  camera  0  on 
the  image  plane  of  camera  1).  In  particular,  in  a 
calibrated  setting  the  2D  projective  transformation 
is  the  rotational  component  of  camera  motion  and 
the  epipole  is  the  translational  component  of  camera 
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motion. 

We  will  occasionally  use  tensorial  notations  as  de¬ 
scribed  next.  We  use  the  covariant-contravariant 
summation  convention:  a  point  is  an  object  whose 
coordinates  are  specified  with  superscripts,  i.e.,  p*  = 
(p\p^,...).  These  are  called  contravariant  vectors. 
An  element  in  the  dual  space  (representing  hyper¬ 
planes  —  lines  in  V^),  is  called  a  covariant  vec¬ 
tor  and  is  represented  by  subscripts,  i.e.,  Sj  — 
(si,S2,  ••••)■  Indices  repeated  in  covariant  and  con¬ 
travariant  forms  are  summed  over,  i.e.,  p's,-  =  p^si  -f 
^p^Sn.  This  is  known  as  a  contraction. 

Matching  image  points  across  three  views  will  be  de¬ 
noted  by  p,p',p";  the  coordinates  will  be  referred  to 
as  p'  jP'-’  jP"*',  or  alternatively  as  non-homogeneous 
image  coordinates  (x,  y),  (x',  y'),  (x",  y”). 

2.2  Stratification  of  Motion  Models 

Three  views,  p  =  [7;0]a!,y  —  Ax  and  p"  =  Bx,  Me 
known  to  produce  four  trilinear  forms  whose  coeffi¬ 
cients  are  arranged  in  a  tensor  representing  a  bilinear 
function  of  the  camera  matrices  A,B: 


a{'‘ =  (1) 

where  A  =  (aj  is  the  3  X  3  left  minor  and  f 

is  the  fourth  column  of  A)  and  B  =  The 

tensor  acts  on  a  triplet  of  matching  points  in  the 
following  way: 

p'sj‘r|^a|*^  =  0  (2) 

where  Sj  are  any  two  lines  (sj  and  sj)  intersecting 

at  p',  and  r?  are  any  two  lines  intersecting  p".  Since 
the  free  indices  are  p,  p  each  in  the  range  1,2,  we  have 
4  trilinear  equations  (unique  up  to  linear  combina¬ 
tions).  More  details  can  be  found  in  [3,  9,  12,  10]. 

Geometrically,  a  trilinear  matching  constraint  is  pro¬ 
duced  by  contracting  the  tensor  with  the  point  p  of 
image  0,  some  line  coincident  with  p'  in  image  1,  and 
some  line  coincident  with  p"  in  image  2.  We  can  de¬ 
scribe  any  line  through  p',  as  a  linear  combination 
of  the  vertical  (1,0, -x')  and  horizontal  (O,  !,-/) 
lines.  Let  the  coefficients  of  the  linear  combination 
be  the  components  of  the  image  gradient  Ixily  3-t 
(a;,  y)  in  image  0,  then  the  line  s'  has  the  form: 


S'  = 


h 

ly 

-x'lx  -  gly 


The  contribution  oi  x',y'  can  be  removed  by  using 
the  constant  brightness  equation  due  to  [4]: 


u'lx  +  v'ly  +  /;  =  0  (3) 


where  u'  =  x  —  x' ,  v'  =  y  —  i/  and  is  the  discrete 
temporal  derivative  at  (x,  y),  i.e.,  7i(a;,  y)  -  Iq{x,  y). 
7i  and  7o  are  the  image  intensity  values  of  the  second 
and  first  images,  respectively.  Following  the  substi¬ 
tution  we  obtain. 


S'  = 


h 

If  —  xlx  ~~  yly 


(4) 


Likewise,  for  the  third  image  we  have  a  simillar  ex 


pression  with  —  xlx  —  yiy  replacing  the  third  line 


in  S",  where  7"  is  the  temporal  derivative  between 


images  0  and  2. 
therefore: 


The  tensor  brightness  constraint  is 


4'sjp'af 


:0. 


(5) 


We  get  a  constraint  equation  involving  the  unknowns 
and  the  spatio-temporal  derivatives  at  each  pixel 
—  the  constraint  is  linear  in  the  unknowns.  This 
constraint  was  introduced  by  [11].  In  other  words, 
one  can  recover  in  principle  the  camera  matrices 
across  three  views  in  the  context  of  the  aperture 
problem,  as  noticed  by  [13]. 

Starting  from  the  general  model  (27  parameter 
model)  of  the  constraint  equation  one  can  intro¬ 
duce  a  hierarchy  of  reduced  models,  as  follows.  By 
enforcing  small-angle  rotation  on  the  camera  mo¬ 
tions,  i.e.,  A  =  [I  +  [w']x',t']  and  S  =  [7-|-  [w"]x',t  ] 
where  w',w"  are  the  angular  velocity  vectors  and 
[•]j;  is  the  skew-symmetric  matrix  of  vector  prod¬ 
ucts,  the  tensor  brightness  constraint  is  reduced  to 
a  24-parameter  model  which  in  matrix  form  looks 
like: 


I''S''^t'-liS"^t"+S'^[t'w"^]V"-S"^[t"w'  ]V'  =  0, 

(6) 

where  V  -  p  x  S'  and  V"  =  p  x  5".  If,  in  ad¬ 
dition,  we  enforce  infinitesimal  translational  motion 
(the  Longuett-Higgins  &  Prazdny  [6]  motion  model), 
which  results  in  the  image  motion  equations: 


u'  =  i(t'i  -  xt'3)  -  w'sy  +  u)2(l  +  (7) 

v'  =  ^(<'2  -  yt'a)  +  w'^x  -  u;'i(l  +  y^)  +  w'^xy 


then  S  has  the  simpler  form: 


S  = 


Ix 

h 

-xlx  ""  yiy 


and 


(  -ly  -  y{^i^  +  yh) 

V^=:px5=[  IxA  x{xlx  -b  yly) 
\  xly  -  ylx 


(8) 


(9) 


We  obtain  a  15-parameter  model  of  the  form: 

-  I'fS^t"  +  S'^ -  t"w'^]V  =  0  (10) 


We  have  one  such  equation  for  each  point  in  the  im¬ 
age.  It  is  a  set  of  bilinear  equations  in  the  unknowns 
t',t",w',w".  After  solving  for  the  camera  motions 
(to  be  described  later  in  the  paper)  we  can  solve  for 
the  dense  depth  map  from  the  equations: 

KS^t'  +  V^w'  +  I't  =  0  (11) 

KS'^t"  +  V'^w"  +  I't'  =  0  (12) 

where  K  =  -  denotes  inverse  of  the  depth  at  each 
pixel  location.  Equations  (11)(12)  were  introduced 
in  [5]  and  can  be  obtained  by  substituting  (eq.  7)  m 
equation  (3)  and  rearranging  the  terms. 
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3  Solving  the  bilinear  equation 

3.1  The  pure  translation  case 

In  the  pure  translation  case  equation  (10)  becomes: 

=  0  (13) 

We  have  one  such  equation  for  each  image  point. 
Writen  in  matrix  form  this  becomes  At  =  0  where 
t  =  (t',  t"}  and  .4  is  an  TV  X  6  matrix  with  the  i'th 
row  (corresponding  to  the  i'th  pixel)  given  by: 

(  I','Sn  I't'Sis  I'tSii  r,Si2  I'tSis  ) 

(14) 

We  avoid  the  trivial  solution  t  =  0  by  adding  the 
constraint  ||t|j  =  1.  The  least  squares  problem  now 
maps  to  problem  of  finding  ||t||  =  1  that  minimizes: 

t^A'^At  =  0  (15) 

The  solution  is  the  eigenvector  of  A^A  correspond¬ 
ing  to  the  smallest  eigenvalue. 

3.2  Translation  with  rotation 

In  the  general  Longuett-Higgins  and  Prazdny  model 
we  are  confronted  with  the  bilinear  equation  (10). 
We  treat  the  6  translation  parameters  and  the  9 
outer  product  terms  [t'w"'^  —  t"w''^]  as  15  interme¬ 
diate  parameters  which  are  solved  for  as  in  section 
(3.1)  but  with  a  15  X  15  matrix  A.  After  recovering 
the  15  intermediate  parameters  we  compute  w'  and 
w"  from  [t'w"^  -  t"w'^]. 

There  is  one  problem.  If  we  consider  the  TV  x  15 
matrix  A  we  note  that  the  vector 

Co  =  (0,0, 0,0, 0,0, 1,0, 0,0, 1,0, 0,0,1)^  (16) 

is  in  the  null  space  of  A.  We  note  that  cq  is  an 
inadmissable  solution.  The  correct  solution  vector 
is  also  in  the  null  space  of  A.  Therefore  the  null 
space  of  A  has  rank  >  2  and  A  is  of  rank  <  13. 

Suppose  we  found  a  vector  6o  in  the  null  space  of 
A  such  that  cq  x  6o  0.  The  two  vectors  co  and 
bo  span  the  null  space  of  A  and  the  desired  solution 
vector  6  is  a  linear  combination  of  the  two: 

6  =  6o+aco  (17) 

In  order  to  find  a  given  cq  and  bo  we  must  apply 
some  constraint.  We  choose  to  enforce  the  constraint 
that  the  matrix  B  =  [t'w"^  —  t"w'^]  be  of  rank  2. 
Let  Bo  and  Co  be  the  9  last  elements  of  the  vectors 
bo  and  cq  organized  in  3  x  3  matrices.  Since  Co  =  I, 
B  will  be  of  rank  2  if  a  is  an  eigenvalue  of  Bo  ■  This 
gives  up  to  3  solutions  and  we  chose  the  eigenvalue 
with  the  smallest  absolute  value. 

4  Implementation  details 

4.1  Iterative  refinement  and  coarse  to 
fine  processing 

The  constant  brightness  constraint  is  a  linearized 
form  of  the  Sum  Square  Difference  (SSD)  criteria. 
The  linear  solution  is  therefore  a  single  iteration  of 


Newton’s  method.  For  iterative  refinement,  first  one 
calculates  motion  and  depth  using  the  above  equa¬ 
tions.  Then,  using  the  depth  and  motion,  images 
1  and  2  are  warped  towards  image  0.  A  correction 
to  the  motion  and  depth  estimates  are  computed 
using  the  warped  images.  In  the  ideal  case,  as  the 
final  result,  the  warped  images  should  appear  nearly 
identical  to  image  0.  For  details  see  [14]. 

In  order  to  deal  with  image  motions  larger  than  1 
pixel  we  use  a  Gaussian  pyramid  for  coarse  to  fine 
processing  [1][2]. 

4.2  Computing  the  depth,  smoothing 
and  interpolation. 

After  recovering  camera  motions,  using  equations 
(11)  and  (12)  we  compute  depth  at  every  point  where 
5^1'  ^  0  OT  S^t"  ^  0.  We  combine  information 
from  both  images  and  interpolate  over  areas  where 
image  gradients  are  small  using  Local  Weighted  Re¬ 
gression.  This  could  be  replaced  by  other  methods 
such  as  RBF’s,  B-splines  or  thin-plate  interpolation. 

Equation  (18)  shows  the  cost  function  used  to  com¬ 
pute  the  depth  at  a  given  point: 

min  arg  0{x,y)\S^t^\'’ 

A  x,y€R  ] 

(18) 

We  sum  over  a  region  R  using  a  windowing  func¬ 
tion  P{x,y)  and  over  the  two  motions  j  =  1, . .  .,2. 
The  \S^P  \P  term  reduces  the  weight  of  points  with 
a  small  gradient  or  where  the  gradient  is  perpendic¬ 
ular  to  the  motion.  We  used  p  —  1.  During  the 
iteration  process  we  used  a  region  (i?)  of  5  x  5  or 
7x7. 

5  Experiments  with  real  images 

5.1  Experimental  procedure 

Images  were  taken  with  a  Phillips  |mc/i  CCD  video 
camera  and  an  8mm  lens  using  the  SGI  Indy  built  in 
frame  grabber  at  640  x  480  pixel  resolution.  Figure 
(la)  shows  one  of  three  images.  Depth  ranged  from 
450mm  to  750mm.  Camera  motions  were  3mm  ver¬ 
tical  and  (about)  5mm  horizontal.  No  special  care 
was  taken  to  ensure  precise  motion.  Image  motions 
were  6  —  llpixels. 

5.2  Results 

We  used  4  levels  of  coarse  to  fine  processing  with  2 
iterations  at  each  level.  Varying  the  number  of  iter¬ 
ations  from  1  through  4  had  no  qualitative  impact 
but  using  a  single  iteration  caused  a  small  change 
in  the  resulting  motion  estimates.  A  7  x  7  region 
was  used  for  the  local  constant  depth  fit  at  all  lev¬ 
els.  Table  (1)  shows  the  estimated  camera  motions. 
It  shows  that  the  first  motion  was  along  the  Y  axis 
and  the  second  motion  along  the  X  axis  and  that  for 
both  cases  the  rotation  was  negligible.  This  is  qual¬ 
itatively  correct.  We  do  not  have  accurate  ground 
truth  estimates. 

Figure  (lb)  shows  the  recovered  depth  map  (in  fact 
this  shows  K{x,y)  =  ^).  Figure  (Ic)  shows  a  3D 
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(c) 


Figure  1:  One  of  the  three  input  images  (a)  and  the 
estimated  K{x,y)  =  ^  inverse  depth  map  (b)  and 
3D  rendering  of  the  surface  with  K  scaled  by  100.0 
(c).  Uses  7x7  region  and  a  local  constant  depth 
model. 

rendering  of  the  surface  K(x,  y).  The  K  values  have 
been  scaled  by  100.0.  In  order  to  get  smoother  and 
more  visually  pleasing  results  a  local  planar  model 
(A'(a;, y)  =  KxX  +  KyV  +  Kq)  was  used  for  the  final 
stage  with  a  30  x  30  region  of  support.  The  results 
are  shown  in  figures  (2a,  2b).  There  is  noticeable 
smoothing  and  overshoots  at  depth  discontinuities 
and  the  tip  of  the  nose.  In  (2b)  the  texture  was 
removed  for  clarity. 


6  Discussion  and  future  work 

We  have  presented  a  new  method  for  recovering 
structure  and  motion  from  3  views  which  does  not 
require  feature  correspondence  or  optical  flow.  We 
have  shown,  using  real  images,  that  the  method  can 
qualitatively  recover  depth  and  motion  in  the  gen¬ 
eral,  small  motion  case.  These  results  are  promising 
but  more  experiments  are  needed  to  test  the  accu¬ 
racy  of  the  motion  estimation. 


(a)  (b) 


Figure  2:  3D  rendering  of  the  surface  K(x,y)  — 
Uses  a  30  X  30  region  and  a  locally  planar  depth 
model.  Note  the  overshoot  at  depth  discontinuities 
around  the  head. 


Table  1:  Motion  estimates  from  real  images. 


FOE  1 
FOE  2 

(-603.7,  -8055) 

(16302,  300.2) 

W1 

W2 

(0.00022,-0.00055,-0.0217) 

( -0.00027  -0.00017  -0.00062  ) 
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Abstract 

This  paper  describes  a  method  for  generat¬ 
ing  dense  depth  maps  given  large  numbers  of 
images  taken  from  arbitrary  positions.  The 
algorithm  presented  is  completely  local  and 
uses  an  epipolar  image  to  generate  for  each 
pixel  an  evidence  versus  depth  and  surface 
normal  distribution.  In  many  cases,  the  dis¬ 
tribution  contains  a  clear  and  distinct  global 
mEiximum.  The  location  of  this  peak  deter¬ 
mines  the  depth  and  its  shape  can  be  used 
to  estimate  the  error.  The  distribution  can 
also  be  used  to  perform  a  maximum  likeli¬ 
hood  fit  of  models  directly  to  the  images.  We 
anticipate  that  the  ability  to  perform  maxi¬ 
mum  likelihood  estimation  from  purely  local 
calculations  will  prove  useful  in  constructing 
three  dimensional  models  from  large  sets  of 
images. 

1  Introduction 

One  approach  to  improving  the  results  ob¬ 
tained  by  stereo  techniques  is  to  use  multiple 
images.  Several  researchers,  such  as  Yachida 
[1986],  have  proposed  trinocular  stereo  al- 

*This  report  describes  research  done  at  the  Artifi¬ 
cial  Intelligence  Laboratory  of  the  Massachusetts  In¬ 
stitute  of  Technology.  Support  for  the  laboratory’s  ar¬ 
tificial  intelligence  research  is  provided  in  part  by  the 
Advanced  Research  Projects  Agency  of  the  Department 
of  the  Defense  under  Office  of  Naval  Research  contract 
N00014-91-J-4038.  The  author  was  also  supported 
by  the  Advanced  Research  Projects  Agency  of  the  De¬ 
partment  of  Defense  under  Rome  Laboratory  contract 
F3060-94-C-0204. 


gorithms.  Others  have  also  used  special 
camera  configurations  to  aid  in  the  corre¬ 
spondence  problem  [Tsai,  1983,  Bolles  et  al., 
1987,  Okutomi  and  Kanade,  1993].  The 
work  presented  here  also  uses  multiple  im¬ 
ages  and  draws  its  major  inspiration  from 
Bolles,  Baker  and  Marimont  [1987].  We 
define  a  construct  called  an  epipolar  image 
and  use  it  to  analyze  evidence  about  depth. 
Like  Tsai  [1983]  and  Okutomi  and  Kanade 
[1993]  we  define  a  cost  function  that  is  ap¬ 
plied  across  multiple  images,  and  like  Cox 
[1996]  we  model  the  occlusion  process.  There 
are  several  important  differences,  however. 
The  epipolar  image  we  define  is  valid  for  ar¬ 
bitrary  camera  positions  and  models  some 
forms  of  occlusion.  Our  method  is  intended 
to  recover  dense  depth  maps  of  built  geom¬ 
etry  (architectural  facades)  using  thousands 
of  images  acquired  from  within  the  scene. 
In  most  cases,  depth  can  be  recovered  using 
purely  local  information,  avoiding  the  com¬ 
putational  costs  of  global  constraints.  Where 
depth  cannot  be  recovered  using  purely  lo¬ 
cal  information,  the  depth  evidence  from  the 
epipolar  image  provides  a  principled  distri¬ 
bution  for  use  in  a  maximum-likelihood  ap¬ 
proach  [Duda  and  Hart,  1973]. 

2  Our  Approach 

We  assume  that  camera  pose  is  known  in  an 
absolute  coordinate  system.  Although  rela¬ 
tive  positions  are  sufficient  for  the  discussion 
in  this  section,  global  positions  allow  us  to 
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Figure  1:  Epipolar  image  geometry.  Figure  2:  Constructing  an  epipolar  image. 


perform  reconstruction  incrementally  using 
disjoint  scenes.  We  also  assume  known  inter¬ 
nal  camera  parameters.  For  a  more  complete 
description  of  our  method  see  [Mellor  et  al., 
1996]. 

2.1  Epipolar  Images 

For  our  analysis  we  will  define  an  epipolar 
image  £  which  is  a  function  of  one  image  and 
a  point  in  that  image.  An  epipolar  image  is 
similar  to  an  epipolar-plane  image  [Bolles  et 
al,  1987],  but  has  one  critical  difference  that 
ensures  it  can  be  constructed  for  every  pixel 
in  an  arbitrary  set  of  images.  Rather  than 
use  projections  of  a  single  epipolar  plane,  we 
construct  the  epipolar  image  from  the  pen¬ 
cil  of  epipolar  planes  defined  by  the  line  f* 
through  one  of  the  camera  centers  C*  and  one 
of  the  pixels  p*  in  that  image^  Hf  (Figure  1). 
n*  is  the  epipolar  plane  formed  by  and  the 
camera  center  Ci.  Epipolar  line  fg  con¬ 
tains  all  of  the  information  about  present 

in  n?. 

is  used  to  denote  a  plane;  the  subscript  identi¬ 
fies  the  t3T)e  (epipolar  or  image);  and  the  superscript 
identifies  the  instance. 


To  simplify  the  analysis  of  an  epipolar  image 
we  can  group  points  from  the  epipolar  lines 
according  to  possible  correspondences  (Fig¬ 
ure  2).  Pi  projects  to  pi  in  H?;  therefore  {p];} 
has  all  of  the  information  contained  in  {n]} 
about  Pi.  Similarly,  there  is  a  distinct  set 
for  P2;  thus  {pj  I  for  a  given  j}  contains  all  of 
the  possible  correspondences  for  Pj.  If  Pj  is  a 
point  on  the  surface  of  a  physical  object  and  it 
is  visible  in  {Ilj }  and  n*,  then  measurements 
taken  at  pi  should  match^  those  taken  at  p* 
(Figure  3a).  Conversely,  if  Pj  is  not  a  point  on 
the  surface  of  a  physical  object  then  the  mea¬ 
surements  taken  at  pj  are  unlikely  to  match 
those  taken  at  p*  (Figures  3b  and  3c).  Epipo¬ 
lar  images  can  be  viewed  as  tables  which  ac¬ 
cumulate  evidence  about  possible  correspon¬ 
dences  of  p*.  A  simple  function  of  i  is  used  to 
build  |Pj  I  Vi  <  j  :  HPi  —  C*|p  <  ||Pj  —  C*||  |. 
In  essence,  {Pj}  is  a  set  of  samples  along  f* 
at  increasing  depths  from  the  image  plane  of 

P*- 


*So  far  we  have  considered  only  diffuse  surfaces. 
The  TTintf-hing  function  can  be  extended  to  account  for 
specularity  and  we  intend  to  do  so  in  the  future. 
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2.2  Analyzing  Epipolar  Images 


An  epipolar  image  £  is  constructed  by  or¬ 
ganizing  image  measurements  into  a  two- 
dimensional  array  with  i  and  j  as  the  ver¬ 
tical  and  horizontal  axes  respectively.  Rows 
in  £  are  epipolar  lines  from  different  im¬ 
ages;  columns  form  sets  of  possible  corre¬ 
spondences  ordered  by  depth^  (Figure  2). 
The  quality  v{j)  of  the  match  between  col¬ 
umn  j  and  p*  can  be  thought  of  as  evidence 
that  p*  is  the  projection  of  Pj  with  depth 
j.  Real  cameras  have  a  finite  field  of  view, 
and  p^  may  not  be  contained  in  the  image 
n?  ^  {n-}j.  Thus,  only  terms  for  which 
p^  G  {n? }  should  be  included,  producing 


^U)  = 


i|pje{nj} 

E  1 

i|p^e{ni} 


(1) 


where  is  an  image  measurement  and  A!() 
is  a  cost  function  which  measures  the  differ¬ 
ence  between  J'(pj)  and  .?^(p*).  Ideally,  i/(j) 
will  have  a  sharp,  distinct  peak  at  the  correct 
depth,  so  that 


argmax(z^(j))  =  the  correct  depth  of  p*. 

j 


or  the  absence  of  a  peak  at  the  correct  depth 
(Figure  4).  A  false  negative  is  essentially  a 
lack  of  evidence  about  the  correct  depth  and 
can  be  addressed  in  two  ways;  removing  the 
contribution  of  occluded  views,  and  adding 
unoccluded  views  by  acquiring  more  images. 

A  large  class  of  occluded  views  can  be  elimi¬ 
nated  quite  simply.  Each  point  Pj  has  an  as¬ 
sociated  normal  nj .  Images  with  camera  cen¬ 
ters  in  the  negative  half  space  defined  by 
cannot  possibly  have  imaged  Pj.  Of  course, 
n^  is  not  known  a  priori,  but  the  fact  that 
Pj  is  visible  in  H*  limits  its  possible  values. 
This  range  of  values  can  then  be  sampled 
and  used  to  eliminate  the  contribution  of  oc¬ 
cluded  views  from  v{j).  Let  a  be  an  estimate 
of  lij  and  CiPj  be  the  unit  vector  along  the 
ray  from  Cf  to  Pj,  then  Pj  can  only  be  visible 
if  CjPj  a  <  0.  The  updated  function  becomes: 


a) 


E(CiP. 


■a)x{J^{p{),J^{p^)) 

ies 


(2) 


where 

pie{n?}  1 

CjPj  ■  O'  <  0  J 


As  the  number  of  elements  in 
I  p|  I  for  a  given  j  I  increases,  the  likeli¬ 
hood  increases  that  ^{j)  will  be  large  when 
Pj  lies  on  a  physical  surface  and  small  when 
it  does  not.  Occlusion  does  not  produce  a 
peak  at  an  incorrect  depth  or  a  false  posi¬ 
tive^.  It  can  however,  cause  false  negatives 

®The  depth  of  Pj  can  be  trivially  calculated  from  j, 
therefore  we  consider  j  and  depth  to  be  interchange¬ 
able. 

^Except  possibly  in  an  adversarial  setting. 


Then,  if  sufficient  evidence  exists. 


arg  max(i/(j,  a)) 

j,a 


j  =  depth  of  p* 
a  an  estimate  of  nj 


3  Results 

S5mthetic  imagery  was  used  to  explore 
the  characteristics  of  u{j)  and  v{j,a).  A 
CAD  model  of  Technology  Square,  the  four- 
building  complex  housing  our  laboratory,  was 
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built  by  hand.  The  locations  and  geome¬ 
tries  of  the  buildings  were  determined  using 
traditional  survey  techniques.  Photographs 
of  the  buildings  were  used  to  extract  tex¬ 
ture  maps  which  were  matched  with  the  sur¬ 
vey  data.  This  three-dimensional  model  was 
then  rendered  from  100  positions  along  a 
“walk  around  the  block”.  From  this  set  of  im¬ 
ages,  a  nf  and  p*  were  chosen  and  an  epipo- 
lar  image  E  constructed.  E  was  then  analyzed 
using  equations  1  and  2  where^ 

T{x)  =  hsv(x)  =  \h{x),  s(x),  v(2:)]'^  (3) 

and 

Si,  [/i2,  S2,  ■^2]'^)  =  (4) 

(2  -  51  -  S2)  |wi  -  W2I  • 

Figure  5  shows  a  base  image  Ilf  with  p* 
marked  by  a  cross.  Under  Ilf  is  the  epipo- 
lar  image  E  generated  using  the  remaining 
99  images.  Below  E  is  the  matching  func¬ 
tion  i/(i)  (1)  and  (2).  The  horizontal 

scale,  j  or  depth,  is  the  same  for  E,  vij)  and 
u{j,a).  The  vertical  axis  of  E  is  the  image 
index,  and  of  i'{j,  n)  is  a  coarse  estimate  of 
the  orientation  a  at  Pj.  The  vertical  axis  of 
v{j)  has  no  significance;  it  is  a  single  row 
that  has  been  replicated  for  clarity.  To  the 
right,  v{j)  and  u{j,  a)  are  also  shown  as  two- 
dimensional  plots®. 


Figure  5a  shows  the  epipolar  image  that  re¬ 
sults  when  the  upper  left-hand  corner  of  the 
foreground  building  is  chosen  as  p*.  Near  the 
bottom  of  E,  £i  is  close  to  horizontal,  and  pf 
is  the  projection  of  blue  sky  everywhere  ex¬ 
cept  at  the  building  corner.  The  corner  points 
show  up  in  E  near  the  right  side  as  a  verti¬ 
cal  streak.  This  is  as  expected  since  the  con¬ 
struction  of  E  places  the  projections  of  Pj  in 
the  same  column.  Near  the  middle  of  E,  the 
long  horizontal  streaks  result  because  Pj  is 

®The  well  known  hue,  saturation,  value  color  model 
is  denoted  by  hsv. 

^Actually  u{j,  a)/  Ya  ^  Plotted  for 


occluded,  and  near  the  top  the  large  black  re¬ 
gion  is  produced  because  pj  ^  H-.  Both  i^(j) 
and  iy(j,  a)  have  a  sharp  peak^  that  corre¬ 
sponds  to  the  vertical  stack  of  corner  points. 
This  peak  occurs  at  a  depth  of  2375  units  (j  = 
321)  for  1^0)  and  a  depth  of  2385  (j  =  322) 
for  The  actual  distance  to  the  corner 

is  2387.4  units.  The  reconstructed  world  co¬ 
ordinates  of  p*  are  [—1441,  —3084, 1830]'^  and 
[-1438,  -3077, 1837]'^  respectively.  The  actual 
coordinates®  are  [—1446,-3078,1846]'^. 

In  Figure  5b,  p*  is  a  point  from  the  interior  of 
a  building  face  with  highly  periodic  texture. 
There  is  a  clear  peak  in  u{j,  a)  that  agrees 
well  with  manual  measurements  and  is  bet¬ 
ter  than  that  in  v{j).  In  Figure  5c,  p*  is  a 
point  on  a  building  face  that  is  occluded  (Fig¬ 
ure  4)  in  a  number  of  views.  Both  v{j)  and 
v{j,a)  produce  fairly  good  peaks  that  agree 
with  manual  measurements. 


To  further  test  our  method,  we  reconstructed 
the  depth  of  a  region  in  one  of  the  images 
(Figure  6).  For  each  pixel  inside  the  black 
rectangle  the  global  maximum  of  v{j,  a)  was 
taken  as  the  depth  of  that  pixel.  Figure  7 
shows  the  reconstructed  world  coordinates® 
for  each  of  the  3000  pixels  in  the  region.  The 
cluster  of  points  beyond  the  left  end  (near 
[0,2])  and  at  the  right  end  of  the  building  cor¬ 
respond  to  sky  points.  The  actual  world  coor¬ 
dinates  were  calculated  from  the  CAD  model 
and  are  shown  in  grey.  The  camera  posi¬ 
tion  is  marked  by  a  grey  line  extending  from 
the  center  of  projection  in  the  direction  of 
the  optical  axis.  The  reconstruction  was  per¬ 
formed  purely  locally  at  each  pixel.  Global 
constraints  such  as  ordering  or  smoothness 
were  not  imposed,  and  no  attempt  was  made 
to  remove  depths  with  low  confidence  or  oth¬ 
erwise  post-process  the  global  maximum  of 


^ White  indicates  minimum  error,  black  maximum. 

®Some  of  the  difference  may  be  due  to  the  fact  that 
p*  was  chosen  by  hand  and  might  not  be  the  exact  pro¬ 
jection  of  the  comer. 

®A11  coordinates  have  been  divided  by  1000  to  sim¬ 
plify  the  plots. 
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100 


Figure  6:  Reconstructed  region. 


a 


b 

Figure  7:  Reconstructed  and  actual  points. 


Figure  8:  Number  of  points  versus  error. 


Next,  we  considered  outliers.  Figure  8 
shows  the  cumulative  distribution  of  recon¬ 
struction  error.  Error  is  expressed  as  a 
percentage  of  the  distance  between  the  re¬ 
constructed  Pr  and  actual  Pa  position  di¬ 
vided  by  the  depth  of  the  reconstructed  point 
(liPr  -  Pall  /  l|Pr  “  C*||).  The  plotted  curve 
indicates  that  the  percentage^®  of  recon¬ 
structed  points  with  an  error  of  less  than  1% 
is  90%  for  the  noise  free  case  and  80%  for 
noise  levels  of  five  times  expected.  Outliers 
also  tend  to  have  less  support  (fewer  cameras 
contributing  to  the  solution)  and  poor  match 
quality  (smaller  values  for  u{j,  a)).  Figure  9 
shows  the  result  of  considering  only  points 
which  have  at  least  n  cameras  contributing 
to  the  solution.  Similar  results  are  obtained 
when  points  with  small  j/{j,  a)  are  removed. 

Finally,  we  reconstructed  another  region 
about  twice  the  size  of  the  previous  one 
which  contained  only  building  points.  This 
time,  we  retained  only  points  with  more  than 
6  cameras  contributing  or  with  i^(j,a)  > 
-0.5.  Figure  10  shows  the  reconstructed 
points^^  rendered  as  oriented  rectangular 
surface  elements  or  surfels  [Szeliski  and  Ton- 
nesen,  1992].  We  anticipate  that  the  esti¬ 
mated  orientation  will  prove  very  useful  in 
fitting  models  to  the  reconstructed  points  or 
grouping  them  into  surfaces. 


^"Sky  points  are  omitted. 

Actually  the  data  is  downsampled  by  three  in  each 
direction  for  clarity. 
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n>2  n>4  n>7 


Figure  9:  Outliers  versus  number  of  contributing  cameras. 
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Figure  10:  Two  views  of  the  reconstructed  surfels. 
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4  Conclusions 

This  paper  describes  a  method  for  generating 
dense  depth  maps  directly  from  large  sets  of 
images  taken  from  arbitrary  poses.  The  algo¬ 
rithm  presented  is  simple,  accurate,  and  uses 
only  local  calculations.  Our  method  builds, 
then  analyzes,  an  epipolar  image  to  accumu¬ 
late  evidence  about  the  depth  at  each  im¬ 
age  pixel.  This  analysis  produces  an  evi¬ 
dence  versus  depth  and  surface  normal  dis¬ 
tribution  that  in  many  cases  contains  a  clear 
and  distinct  global  maximum.  The  location 
of  this  peak  determines  depth  and  orienta¬ 
tion,  and  its  shape  can  be  used  to  estimate 
the  error.  The  distribution  can  also  be  used 
to  perform  a  maximum  likelihood  fit  of  mod¬ 
els  directly  to  the  images.  We  anticipate  that 
the  ability  to  perform  maximum  likelihood 
estimation  from  purely  local  calculations  will 
prove  extremely  useful  in  constructing  three- 
dimensional  models  from  large  sets  of  im¬ 
ages. 
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Shape-based  registration  is  a  process  for  estimating 
the  transformation  between  two  shape  representa¬ 
tions  of  an  object.  It  is  used  in  many  image-guided 
surgical  systems  to  establish  a  transformation 
between  pre-  and  intra-operative  coordinate  sys¬ 
tems.  This  paper  describes  several  tools  which  are 
useful  for  improving  the  accuracy  resulting  from 
shape-based  registration:  constraint  analysis,  con¬ 
straint  synthesis,  and  online  accuracy  estimation. 
Constraint  analysis  provides  a  scalar  measure  of 
sensitivity  which  is  well  correlated  with  registra¬ 
tion  accuracy.  This  measure  can  be  used  as  a  crite¬ 
rion  function  by  constraint  synthesis,  an 
optimization  process  which  generates  configura¬ 
tions  of  registration  data  which  maximize  expected 
accuracy.  Online  accuracy  estimation  uses  a  con¬ 
ventional  root-mean-squared  error  measure  cou¬ 
pled  with  constraint  analysis  to  estimate  an  upper 
bound  on  true  registration  error.  This  paper  demon¬ 
strates  that  registration  accuracy  can  be  signifi¬ 
cantly  improved  via  application  of  these  methods. 

1.  Introduction 

The  registration  process  is  a  fundamental  compo¬ 
nent  of  most  image-guided  surgical  systems.  Reg¬ 
istration  estimates  a  spatial  transformation  between 
two  coordinate  systems:  a  pre-operative  system 
used  to  construct  plans  or  simulations  based  upon 
medical  data  (e.g.,  CT,  MRI,  or  X-ray  images),  and 
an  intra-operative  system  in  which  the  surgical  pro¬ 
cedure  is  performed  (e.g.,  relative  to  a  robot,  navi¬ 
gational  guidance  system,  etc.)  Any  image-guided 
surgical  procedure  which  spatially  relates  pre-oper¬ 
ative  data  to  intra-operative  execution  requires 
solution  of  the  registration  problem. 

There  are  many  approaches  to  registration  for 
image-guided  surgery  and  an  excellent  review  can 
be  found  in  [5].  A  class  of  registration  methods 
referred  to  as  shape-based  methods  uses  represen¬ 


tations  of  object  shape  to  estimate  the  required 
transformation.  Representations  are  constmcted 
using  data  collected  in  the  two  coordinate  systems 
(i.e.,  pre-  and  intra-operative).  Registration  esti¬ 
mates  a  transformation  which  aligns  one  shape  rep¬ 
resentation  with  the  other  in  a  manner  which 
minimizes  a  measure  of  the  distance  between  them. 

Several  factors  affect  shape-based  registration 
accuracy,  including:  errors  in  the  shape  representa¬ 
tions  due  to  sensor  noise  or  shape  reconstruction 
errors  [10];  the  quantity  of  registration  data;  and 
the  locations  on  the  registration  object  from  which 
the  data  are  collected  [11].  This  paper  addresses 
the  problem  of  improving  shape-based  registration 
accuracy  via  intelligent  selection  of  registration 
data  and  online  estimation  of  accuracy.  Intelligent 
data  selection  (IDS)  is  comprised  of  geometric 
constraint  analysis  which  provides  a  sensitivity 
measure  shown  to  be  well  correlated  with  registra¬ 
tion  accuracy;  and  geometric  constraint  synthesis, 
an  optimization  process  which  generates  data  con¬ 
figurations  which  maximize  the  sensitivity  measure 
for  a  fixed  quantity  of  data.  IDS  uses  the  pre-opera¬ 
tive  shape  representation  to  generate  a  data  collec¬ 
tion  plan  (DCP)  which  can  be  used  during  surgery 
to  guide  the  acquisition  of  registration  data.  Online 
accuracy  estimation  provides  an  upper  bound  on 
true  registration  accuracy  based  upon  a  conven¬ 
tional  root-mean-squared  error. 

The  proposed  methods  have  been  investigated  in- 
vitro  on  cadaveric  specimens  and  via  simulation 
studies  and  are  currently  being  incorporated  into  a 
clinical  image-guided  orthopaedic  surgical  applica¬ 
tion  [9].  The  current  paper  describes  the  methods, 
reports  encouraging  results,  and  suggests 
approaches  for  incorporating  the  methods  into  clin¬ 
ically  viable  registration  systems. 
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2.  Methods 


cal  expression  for  this  distance  given  an  arbitrary 
surface;  however,  the  following  local  approxima¬ 
tion  has  been  proposed  [12]: 


This  paper  focuses  on  a  special  case  of  shape- 
based  registration:  surface-based  registration  with 
discrete  point  data.  One  shape  representation  (the 
“Model”)  is  a  triangle  mesh  surface  model  of  the 
registration  object  constructed  from  CT  images. 
The  other  representation  (the  “Data  )  is  a  set  of 
discrete  point  data  collected  from  the  registration 
object  during  surgery  using  a  digitizing  probe. 

2.1.  Constraint  Analysis 

Most  approaches  to  shape-based  registration 
attempt  to  minimize  an  error  measure  such  as  the 
following  least- squared  measure: 

T  i 

where  each  D,-  represents  a  point  in  the  Data,  each 
Mi  represent  a  point  in  the  Model,  and  T  is  a  3-D 
transformation  which  minimizes  the  expression. 
Details  of  shape-based  registration  methods  can  be 
found  in  [2][3][5],  and  descriptions  of  the  methods 
used  in  this  work  appear  in  [8].  Due  to  fundamental 
similarities  among  shape-based  registration  solu¬ 
tion  methods,  the  techniques  proposed  in  this  paper 
are  independent  of  the  particular  registration  solu¬ 
tion  method  used. 

Solving  the  registration  problem  results  in  an  esti¬ 
mate,  Tgsp  of  the  true  (and  usually  unknown)  regis¬ 
tration  transformation,  Ttrue-  The  error  resulting 
from  a  single  registration  trial  can  be  expressed  as. 


D{x)  = 


Fix) 

l|VF(jr)l| 


(3) 


where  x  is  a  point  which  may  or  may  not  lie  on  the 
surface.  Fix)  =  0  is  the  implicit  equation  of  the 
surface!  1|VF(jc)1|  is  the  magnitude  of  the  gradient 
of  F  at  jc,  and  D(jc)  is  the  approximate  distance.  It 
can  be  shown  that  Dix)  is  a  first  order  approxima¬ 
tion  of  the  tme  point-to-surface  distance,  and  is 
exact  when  the  surface  is  a  plane. 


Given  a  point  x^  which  lies  on  the  surface,  a  small 
transformation,  Tg,  will  perturb  this  point  from  its 
resting  position.  Tg  can  be  represented  by  a  homo¬ 
geneous  transformation  which  is  a  function  of  the  6 
parameter  vector. 


X 

‘  =  [', 

in  which  (to^,  (O^,  co^)  are  rotations  about  the  X,  Y, 
and  Z  axes  respectively,  and  (t^,  t  ,  t^)  are  transla¬ 
tions  along  the  newly  rotated  X,  Y,  and  Z  axes.  The 
gradient  of  D  with  respect  to  t  is  a  6-vector, 


V,.,)  =  J  (5) 

where  n  is  the  unit  normal  to  the  surface  at  the 
point  x^  [8].  This  result  can  be  extended  to  con¬ 
sider  the  effect  of  perturbing  a  collection  of  points 
with  respect  to  the  surface: 


Terr  =  Ttrue -Test  <2) 

where  Terr  is  a  transformation  which  represents  the 
difference  between  estimated  and  true  transforma¬ 
tions.  The  goal  of  constraint  analysis  is  to  provide  a 
scalar  measure  of  sensitivity  which  is  a  good  pre¬ 
dictor  of  Terr  for  a  given  Model  and  Data  without 
performing  registration,  and  without  the  need  to 

know  Ttrue- 

2.1.1.  Derivation  of  the  Method 

The  point-to-surface  distance  in  (1)  is  defined  as 
the  length  of  the  shortest  line  joining  a  point  and  a 
surface.  In  general,  there  is  no  closed  form  analyti- 


EpiT^ixf))  =  dt^  X 

Ix^eP  -I  (6) 

=  dt^'Ypdt 

The  scalar  quantity  EpiT^ix^))  is  a  first  order 
approximation  of  the  least-squared  error  of  (1).  It 
measures  the  error  which  would  result  by  perturb¬ 
ing  a  set  of  discrete  points,  P,  initially  assumed  to 
be  on  the  surface,  by  the  small  transformation  Tg. 
The  matrix  T'p  is  a  symmetric,  positive  semi-defi- 
nite  6x6  scatter  matrix  which  contains  information 
about  the  distribution  of  the  original  V ix^)  vectors 
over  the  points  in  the  set  P.  Performing  principal 


902 


component  analysis  [4],  'P^  is  transformed  into  an 
expression  which  is  more  easily  interpreted: 

EpiT^ix^))  =  dt^QAQ^dt 

^  T  2  (7) 

=  S  9i) 

i=  1 

where  A  =  diag[X^  ...  A,g]  is  a  diagonal  6x6 
matrix  of  the  eigenvalues  of  in  which 

>  ^-2  >  ^,3  >  ?L4  >  >  A-g ;  Q  is  a  6x6  matrix 

whose  columns  are  the  eigenvectors  of  'P^;  and 
each  is  an  eigenvector  corresponding  to  the 
eigenvalue  A,^-  which  represents  a  differential  trans¬ 
formation  6-vector.  This  result  is  similar  to  one 
presented  in  the  context  of  industrial  inspection 
[6]. 

From  (7)  it  can  be  seen  that  the  eigenvector  cor¬ 
responding  to  the  largest  eigenvalue,  represents  the 
direction  of  maximum  constraint.  Perturbing  the 
points  in  the  set  P  in  the  direction  of  q^  will  result 
in  the  largest  possible  change  in  Ep  from  among 
all  possible  directions  of  perturbation.  Similarly, 
the  differential  transformation  represented  by  the 
eigenvector  q^  corresponds  to  the  direction  of 
maximum  freedom.  Perturbing  the  points  in  this 
direction  will  result  in  the  smallest  possible  change 
in  £'p  from  among  all  possible  directions  of  per¬ 
turbation.  In  general,  an  eigenvalue  A,^.  is  propor¬ 
tional  to  the  rate  of  change  of  Ep  induced  by  a 
differential  transformation  in  the  direction  q^ . 

A  special  situation  occurs  when  some  of  the  A.^-  are 
close  to  or  equal  to  zero.  For  each  such  eigenvalue, 
a  singularity  exists  such  that  perturbing  the  points 
in  the  direction  of  the  corresponding  eigenvector 
will  result  in  no  change  in  Ep .  Such  singularities 
are  undesirable  in  registration  since  it  is  impossible 
to  localize  the  object  in  the  direction(s)  of  the  sin- 
gularity(s).  As  demonstrated  below,  sets  of  discrete 
points,  P,  which  have  well-conditioned  scatter 
matrices,  'Pp ,  are  preferable  to  sets  which  have  ill- 
conditioned  scatter  matrices  for  achieving  accurate 
registration.  In  this  work,  the  noise  amplification 
index  (NAI)  [7]  is  used  as  a  measure  of  matrix  con¬ 
ditioning  and  is  defined  as 


This  quantity  is  the  product  of  the  inverse  condi¬ 
tion  number  and  the  square  root  of  the  minimum 
eigenvalue,  and  provides  an  upper  bound  on  the 
amplification  of  residual  errors  (e.g.,  discrete  point 
Data  measurement  noise,  and  errors  in  the  Model) 
to  the  estimated  parameters  (e.g.,  registration  trans¬ 
formation  parameters)  [7]. 

2.1.2.  Scale  and  Coordinate  System 
Dependences 

In  constraint  analysis,  there  is  an  implicit  weight¬ 
ing  factor  related  to  object  size  which  determines 
the  relative  importance  of  rotational  versus  transla¬ 
tional  errors.  Due  to  the  x^  term  on  the  right  hand 
side  of  (5),  if  constraint  analysis  is  applied  to  two 
objects  which  differ  only  in  size,  the  resulting  NAI 
values  will  differ.  The  larger  object  will  weight 
rotational  components  more  heavily  since  the  cor¬ 
responding  x^  terms  will  be  larger.  A  solution  to 
this  problem  is  to  pre-normalize  the  Model  so  that 
the  average  radius  as  measured  about  the  origin  is 
unity.  This  has  the  effect  of  weighting  rotational 
and  translational  components  equally,  on  average. 
A  complete  discussion  of  the  scale  dependence 
problem  can  be  found  in  [8]. 

Constraint  analysis  has  a  dependence  upon  the 
location  of  the  origin  of  the  Model  coordinate  sys¬ 
tem  arising  from  the  x^  term  on  the  right  hand  side 
of  (5).  For  a  given  Model,  it  can  be  shown  that 
maximal  sensitivity  of  constraint  analysis  is 
achieved  when  the  constraint  analysis  coordinate 
system  origin  is  located  at  the  centroid  of  the 
Model  [8]. 

2.2.  Constraint  Synthesis 

The  goal  of  constraint  synthesis  is  to  automatically 
generate  Data  sets  which  maximize  the  NAI  for  a 
given  Model  and  a  fixed  number  of  points.  The 
resulting  data  collection  plan  (DCP)  can  then  be 
used  to  guide  the  acquisition  of  Data  during  the 
intra-operative  Data  collection  process.  More  for¬ 
mally,  the  constraint  synthesis  problem  is  to: 

Select  M  discrete  points  from  a  set,  V,  and  place 
them  into  the  set,  P,  of  (6)  such  that  the  NAI  is 
maximized. 
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In  this  paper,  the  set  V  is  composed  of  all  vertices 
of  a  given  triangle  mesh  Model.  In  general,  any 
sufficiently  dense  sampling  of  a  surface  can  be 
used  for  V:  If  there  are  regions  of  the  Model  in 
which  Data  cannot  be  collected  (e.g.,  because  of 
limited  access  during  surgery),  points  in  these 
regions  can  be  excluded  from  V.  The  number  of 
points,  M,  in  P  is  fixed  for  a  given  trial  of  con¬ 
straint  synthesis.  Finding  Data  configurations  of 
minimum  size  which  satisfy  registration  accuracy 
requirements  is  discussed  below. 

Constraint  synthesis  is  a  combinatorial  search 
problem,  and  for  all  but  artificially  small  problems 
the  solution  space  is  too  large  to  search  exhaus¬ 
tively.  A  search  algorithm  for  solving  constraint 
synthesis  which  combines  hillclimbing  and  a  non- 
deterministic  optimization  method  is  described 
below.  A  complete  description  of  constraint  syn¬ 
thesis  solution  methods  can  be  found  in  [8]. 

2.2.1.  Hybrid  PBIL  /  Hillclimbing  Search 
Algorithm 

In  next  ascent  hillclimbing  (NAH),  M  vertices  are 
chosen  from  the  set  of  possible  vertices,  V,  and 
placed  into  the  “selected”  set  P.  Let  NAI(P)  repre¬ 
sent  the  value  of  the  NAI  computed  from  (5)  -  (8) 
using  the  points  in  P.  Randomly  select  a  vertex,  v^, 
from  P,  and  a  vertex,  from  V.  Substitute  v^  with 
V,,  in  P,  and  compute  the  new  value  of  NAI(P).  If 
this  substitution  results  in  an  improvement  in  the 
NAI,  then  implement  the  substitution  and  iterate 
the  process.  If  the  substitution  does  not  improve 
the  NAI,  then  recompute  NAI(P)  with  new  ran¬ 
domly  selected  vertices  and  Vj,.  Continue  iterat¬ 
ing  until  there  are  no  additional  substitutions  which 
improve  the  NAI.  The  maximum  number  of  NAI 
evaluations  per  iteration  is  N(M-l)  where  N  and  M 
are  the  number  of  vertices  in  the  sets  V  and  P 
respectively,  although  such  a  large  number  of  eval¬ 
uations  is  rarely  reached  in  practice.  For  an  average 
size  problem  (e.g.,  N  ==  5000,  M  ~  50 ),  NAH  usu¬ 
ally  converges  within  1000  iterations.  The  number 
of  NAI  evaluations  is  typically  small  during  initial 
iterations,  and  increases  during  the  later  iterations 
when  there  are  fewer  possible  substitutions  which 
increase  NAI(P). 

In  high-dimensionality  optimization  problems, 
hillclimbing  methods  such  as  NAH  are  susceptible 


to  local  minima  in  the  search  space.  Genetic  algo¬ 
rithms  (GAs)  are  biologically  motivated  adaptive 
systems  based  upon  principles  of  natural  selection 
and  genetic  recombination  which  attempt  to  avoid 
local  minima.  A  simplified  model  of  the  GA  called 
Population-Based  Incremental  Learning  (PBIL) 
was  recently  introduced  [1].  For  the  purposes  of 
this  paper,  PBIL  can  be  thought  of  as  a  black-box 
with  the  following  inputs:  the  set  of  allowable  ver¬ 
tices,  V\  the  number  of  points  in  the  configuration 
set,  P;  a  function  which  computes  NAI(P)  based 
upon  the  surface  Model;  and  a  stopping  criterion. 
The  output  of  PBIL  is  the  particular  configuration 
which  maximizes  the  NAI  among  all  configura¬ 
tions  evaluated  by  PBIL  within  a  given  trial. 

While  PBIL  is  good  at  avoiding  local  minima  in 
the  constraint  synthesis  search  space,  the  resulting 
solutions  may  not  be  locally  optimal.  Likewise, 
hillclimbing  methods  are  good  at  ensuring  local 
optimality,  but  usually  don’t  converge  to  globally 
optimal  configurations.  By  combining  these  two 
approaches,  it  is  possible  to  take  advantage  of  the 
strengths  of  each.  In  the  hybrid  search  algorithm, 
PBIL  is  run,  followed  by  a  run  of  NAH  initialized 
at  the  configuration  found  by  PBIL. 

2.3.  Data  Configuration  Stability 

Data  collection  plans  (DCPs)  generated  by  con¬ 
straint  synthesis  can  be  used  to  guide  acquisition  of 
registration  data.  Since  the  precise  object  location 
is  unknown  before  registration,  it  is  impossible  to 
acquire  the  exact  points  specified  by  constraint 
synthesis.  Due  to  this  uncertainty  and  to  noise  in 
the  sensing  process,  the  effective  NAI  value  (i.e., 
computed  from  the  collected  Data)  may  be  smaller 
than  the  ideal  NAI  (i.e.,  computed  from  the  DCP). 
Certain  Data  configurations  are  more  stable  than 
others  (i.e.,  there  is  less  NAI  variation  as  the  points 
in  P  are  perturbed  about  the  DCP  positions). 
Attempts  to  incorporate  stability  criteria  into  the 
constraint  synthesis  process  have  resulted  in  expo¬ 
nential  complexity  [8].  Nevertheless,  improved  sta¬ 
bility  can  be  achieved  via  two  methods: 
navigational  guidance  during  Data  collection  and 
high  curvature  filtering. 

During  the  data  acquisition  process,  it  is  possible  to 
use  the  current  registration  transformation  estimate 
to  provide  navigational  guidance  to  the  human 
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Data  collector.  Guidance  is  provided  by  displaying 
a  3-D  graphical  rendering  of  the  registration  object 
and  overlaying  icons  representing  the  locations  of 
the  desired  point  and  the  sensor  tip.  The  sensor 
location  icon  is  dynamically  updated  in  real-time 
based  upon  measurements,  and  is  derived  from  the 
registration  transformation  estimate.  The  goal  of 
the  Data  collector  is  to  align  the  two  icons.  Each 
time  additional  Data  is  collected,  uncertainty  in  the 
collection  process  is  reduced  by  refining  the  regis¬ 
tration  transformation  estimate. 

The  primary  cause  of  Data  configuration  instability 
is  disparity  between  the  surface  normals  of  desired 
and  collected  Data  points.  Constraint  synthesis 
may  select  a  given  Data  point  because  its  surface 
normal  strongly  contributes  to  constraint  in  a  given 
direction  (see  (5)).  However,  if  the  Data  point  actu¬ 
ally  collected  has  a  significantly  different  surface 
normal,  the  resulting  NAI  value  may  be  reduced. 
This  effect  can  be  reduced  by  initially  focussing 
data  collection  in  regions  of  low  curvature  so  that 
surface  normals  of  the  collected  points  are  more 
likely  to  be  similar  to  those  of  the  desired  points. 
After  low  curvature  points  are  collected  and  collec¬ 
tion  uncertainty  is  reduced,  points  in  higher  curva¬ 
ture  regions  can  be  collected.  To  implement  this, 
several  DCPs  are  synthesized,  some  with  points  in 
regions  of  low  curvature  and  others  in  regions  of 
higher  curvature  [8].  The  resulting  DCPs  can  then 
be  used  to  guide  the  collection  process. 

3.  Results 

This  section  demonstrates  significant  improvement 
in  registration  accuracy  due  to  the  proposed  meth¬ 
ods.  Three  Models  are  used  in  the  reported  experi¬ 
ments:  a  cube  with  edge  length  of  100  mm,  a 
human  cadaveric  femur,  and  a  human  cadaveric 
pelvis.  Models  of  the  femur  and  pelvis  with  super¬ 


imposed  Data  collection  plans  are  shown  in 
Figure  1. 

The  registration  error  measure  used  to  report 
results  in  this  section  is  the  maximum  correspon¬ 
dence  error  (MCE)  [8]  [11].  The  MCE  is  computed 
by  transforming  all  vertices  in  a  Model  by  of 
(2),  computing  distances  between  each  trans¬ 
formed  vertex  and  its  un-transformed  correspon¬ 
dence,  and  selecting  the  largest  distance.  The  MCE 
specifies  the  largest  single  point  displacement 
within  a  registration  object  resulting  from 

3.1.  Constraint  Analysis  Experiments 

Registration  trials  were  conducted  using  simulated 
Data  to  demonstrate  the  relation  between  registra¬ 
tion  error  and  NAI.  Data  points  were  generated  by 
applying  known  random  transformations  to  nomi¬ 
nal  Data  configurations  and  adding  zero-mean 
Gaussian  noise.  Since  the  true  transformations, 
’^true>  known,  the  error  transformations,  Tg^r, 
can  be  computed.  Figure  2  shows  a  plot  of  MCE 
vs.  NAI  for  the  three  nominal  cube  configurations 
shown  on  the  right  of  the  figure.  Configuration  Cl 
contains  24  points  per  face,  while  C2  and  C3  con¬ 
tain  4  points  per  face  each.  For  each  configuration, 
the  mean,  standard  deviation,  minimum  and  maxi¬ 
mum  MCE  over  500  registration  trials  are  plotted. 
The  parameters  for  generating  noise  and  random 
transformations  are  shown  in  the  plot.  The  trend 
from  the  plot  is  clear:  configurations  with  larger 
values  of  NAI  result  in  smaller  registration  error.  In 
particular,  note  that  configuration  C2  has  smaller 
values  of  MCE  (and  a  larger  NAI)  than  C3,  despite 
having  the  same  number  of  points. 

Figure  3  demonstrates  differences  in  noise  sensitiv¬ 
ity  as  a  function  of  NAI  for  the  cube  configura¬ 
tions.  The  graphs  show  how  MCE  varies  as  a 


Figure  1 :  Surface  Models  of  the  femur  and  pelvis  with  overlaid  DCPs. 
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Figure  2:  MCE  vs.  NAI  for  3  configurations  on  a  cube. 


TABLE  1.  Pelvis  synthesis  results  -  NAI  values  (max  /  min). 


Method 

Configuration  Size  -  M 

25 

50 

75 

Random 

0.41  /  0.00 

1.18/0.08 

1.52/0.33 

1.68/0.53 

NAH 

1.42/ 1.28 

2.62/2.43 

3.97  /  3.76 

4.90/4.79 

PBIL 

1.41/1.35 

2.70/2.58 

3.88/3.81 

4.84/4.76 

PBIL  +  NAH 

1.52/1.36 

2.75  /  2.65 

4.02/3.92 

4.94/4.90 

function  of  expected  noise  magnitude.  For  each 
datum,  500  registration  trials  were  performed  and 
the  mean  values  for  these  trials  are  plotted.  In  the 
absence  of  noise,  all  three  configurations  perform 
equally  well.  As  noise  increases,  configurations 
with  smaller  values  of  NAI  are  clearly  more  sensi¬ 
tive.  This  illustrates  that  the  utility  of  intelligent 
data  selection  is  dependent  upon  the  magnitude  of 
sensor  noise  (among  other  factors). 

3.2.  Constraint  Synthesis  Experiments 

Table  1  demonstrates  the  efficacy  of  the  constraint 
synthesis  search  algorithms  for  the  pelvis.  Data 
configurations  were  synthesized  using  4  configura¬ 
tion  sizes  and  4  methods  of  generation.  Five  con¬ 
figurations  were  generated  for  each  size-method 


combination,  except  for  the  random  method  for 
which  1000  configurations  were  generated.  The 
maximum  and  minimum  NAI  values  over  the  gen¬ 
erated  configurations  are  shown.  For  each  configu¬ 
ration  size,  the  hybrid  PBIL/NAH  method 
produced  the  best  results. 

Figure  4  compares  5  random  and  5  synthesized 
configurations  of  size  25  for  the  pelvis  in  a  plot  of 
MCE  versus  NAI.  For  each  configuration,  a  set  of 
registration  trials  was  performed  using  the  indi¬ 
cated  parameters.  In  this  graph,  the  5th  and  95th 
percentiles  of  MCE  are  plotted  instead  of  the  mini¬ 
mum  and  maximum  values.  When  generating  the 
simulated  registration  Data,  a  second  noise  compo¬ 
nent  was  added  which  models  the  uncertainty  asso¬ 
ciated  with  Data  collection.  This  noise  perturbs  a 
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Figure  3:  MCE  vs.  expected  noise  magnitude  for  3  cube  configurations. 
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Figure  4:  MCE  vs.  NAI  for  5  random  and  5  synthesized  configurations  on  pelvis  Model. 


point  from  its  nominal  location  by  a  uniform  ran¬ 
dom  distance  along  the  surface.  For  this  experi¬ 
ment,  the  radius  of  uncertainty  was  5.0  mm.  From 
the  graph,  it  is  clear  that  the  synthesized  configura¬ 
tions  are  superior  to  the  randomly  generated  ones 
in  terms  of  both  NAI  and  MCE. 

Figure  5  shows  similar  results  for  the  femur  Model 
using  5  random  and  5  synthesized  configurations 
of  size  10.  The  figure  demonstrates  the  effect  of 
high  curvature  filtering;  no  filtering  results  in 
unstable  Data  configurations  and  larger  errors. 

3.3.  In-vitro  Cadaver  Experiments 

We  performed  registration  trials  using  Data  col¬ 
lected  from  a  cadaveric  femur.  For  these  experi¬ 
ments,  estimation  of  is  a  challenging 

engineering  problem  which  our  group  has  solved 
using  a  highly  accurate  fiducial-based  registration 
method  [11].  Using  a  filtered  version  of  the  femur 
Model,  DCPs  of  6  and  50  points  were  synthesized, 
each  a  total  of  5  times.  The  corresponding  Data 
points  were  collected  on  the  actual  femur  using  a 
digitizing  probe.  Each  synthesized  configuration 
was  independently  collected  5  times.  In  addition, 
50  manually-selected  Data  sets  were  collected  for 
each  configuration  size.  To  guide  the  collection 


process,  the  navigational  guidance  mechanism 
described  above  was  used.  Initial  values  of  '^est 
were  computed  using  manually  selected  anatomi¬ 
cal  landmarks  and  point-to-point  registration  [2]. 

Experimental  results  are  shown  in  Figure  6.  Each 
graph  plots  the  MCE  value  resulting  from  registra¬ 
tion  versus  the  ejfective  NAI  value  computed  after 
registration  using  the  closest  Model  points  fM,-  of 
(1))  to  solve  for  n  and  of  (5).  From  the  graphs  it 
is  clear  that  the  synthesized  point  configurations 
are  superior  to  the  manually  selected  ones  for  both 
configuration  sizes.  Six  points  is  the  theoretical 
minimum  number  required  to  solve  the  shape- 
based  registration  problem  without  correspon¬ 
dence.  As  seen,  selecting  6  well-conditioned  Data 
points  is  a  difficult  task  for  humans.  Note  that  some 
synthesized  configurations  for  the  6-point  results 
have  small  NAI  values  and  large  MCE  values  due 
to  data  collection  uncertainty.  However,  using  the 
online  accuracy  evaluation  method  described 
below,  such  configurations  can  easily  be  identified 
and  additional  Data  can  be  collected  to  improve  the 
result. 

To  be  useful,  an  online  accuracy  estimate  must 
relate  a  quantity  which  can  be  measured  during  the 
registration  process,  to  a  second  quantity  which  has 


-  Expected  noise 
magnitude:  0.5  mm 

-  Number  of  points 
per  configuration:  10 

-  Radius  of  collection 
uncertainty:  5.0  mm 

-  Reg.  Trials:  500 
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Figure  5:  Femur;  MCE  vs.  NAI  -  random  and  synthesized  points. 
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Figure  6:  MCE  vs.  NAI  -  physically  collected  Data  on  femur.  Note  scale  differences. 

physical  meaning  to  the  task  for  which  registration  ments  are  satisfied,  or  until  all  of  the  synthesized 

is  being  performed.  Figure  7  shows  a  plot  of  MCE  sets  have  been  collected. 

versus  RMS  error  (definition  in  the  figure).  It  is 

shown  in  [8]  that  the  slope  of  the  line  which  relates  4.  Conclusions 

worst  case  MCE  to  RMS  error  is  independent  of 

sensor  noise,  the  number  of  Data  points,  and  Data  xhe  methods  described  in  this  paper  show  promise 

collection  uncertainty,  assuming  that  the  effective  as  tools  for  analyzing  and  maximizing  accuracy  in 

NAI  value  is  slightly  greater  than  zero.  Further-  shape-based  registration.  Intelligent  data  selection 

more,  it  is  shown  that  the  slope  of  this  line  can  be  is  likely  to  be  most  useful  when  data  collection  is 

determined  from  simulated  registration  experi-  expensive  and  sensor  noise  is  high.  Online  accu- 

ments  such  as  those  of  Section  3.2..  Therefore,  dur-  ^acy  estimation  is  likely  to  be  useful  with  and  with- 

ing  the  registration  process,  online  measurement  of  out  intelligent  data  selection.  Work  is  currently  in 

RMS  error  can  be  used  to  estimate  an  upper  bound  progress  to  evaluate  the  practicality  of  these  meth- 

on  MCE.  This  estimate  can  then  be  used  to  deter-  ods  in  clinical  situations, 

mine  if  accuracy  requirements  are  satisfied,  and 
additional  Data  collection  can  be  requested  if  not.  References 


By  coupling  online  accuracy  estimation  with  intel¬ 
ligent  data  selection,  it  is  possible  to  collect  mini¬ 
mally-sized  Data  sets  which  satisfy  accuracy 
requirements.  This  is  done  by  pre-synthesizing 
multiple  NAI-optimal  configurations  of  increasing 
size,  each  of  which  is  a  superset  of  the  previous. 
During  the  collection  process,  a  Data  set  is  col¬ 
lected  and  registered,  and  accuracy  is  estimated. 
This  process  is  continued  until  accuracy  require- 
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Abstract 


In  this  paper,  we  present  a  robust  method 
for  creating  a  triangulated  surface  mesh 
from  multiple  range  images.  Our  method 
merges  a  set  of  range  images  into  a 
volumetric  implicit-surface  representation 
which  is  converted  to  a  surface  mesh  us¬ 
ing  a  variant  of  the  marching-cubes  algo¬ 
rithm.  Unlike  previous  techniques  based 
on  implicit-surface  representations,  our 
method  estimates  the  signed  distance  to 
the  object  surface  by  finding  a  consensus 
of  locally  coherent  observations  of  the  sur¬ 
face.  We  call  this  method  the  consensus- 
surface  algorithm.  This  algorithm  effec¬ 
tively  eliminates  many  of  the  troublesome 
effects  of  noise  and  extraneous  surface  ob¬ 
servations  without  sacrificing  the  accu¬ 
racy  of  the  resulting  surface.  We  utilize 
octrees  to  represent  volumetric  implicit 
surfaces — effectively  reducing  the  compu¬ 
tation  and  memory  requirements  of  the 
volumetric  representation  without  sacri¬ 
ficing  resolution  (and,  hence,  accuracy)  of 
the  volume  grid.  Our  results  demonstrate 
that  our  consensus-surface  algorithm  can 
construct  accurate  geometric  models  from 
rather  noisy  input  range  data  and  some¬ 
what  imperfect  alignment. 
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1  Introduction 

In  this  paper,  we  present  a  novel  approach  for  build¬ 
ing  a  3D  surface  model  from  a  number  of  range  im¬ 
ages  of  an  object.  The  goal  of  this  work  is  to  use  real 
images  of  an  object  to  automatically  create  a  model 
which  is: 

•  Geometrically  accurate:  depicts  the  correct  di¬ 
mensions  of  the  object  and  captures  small  de¬ 
tails  of  the  object  geometry 

•  Clean:  eliminates  noise  and  errors  in  the  views 

•  Complete:  models  the  surface  as  much  as  is 
observable  from  the  sample  views 

We  begin  by  reviewing  three  methods  which  are 
most  closely  related  to  our  work  and  follow  that  by 
a  brief  discussion  of  other  related  work.  The  first 
three  works  are  similar  to  our  algorithm  in  that  they 
all  make  use  of  volumetric,  implicit-surface  represen¬ 
tation  and  the  marching-cubes  algorithm  [Lorensen 
and  Cline,  1987]  to  merge  the  range-image  data  from 
several  views  into  a  surface  model.  The  main  differ¬ 
ences  between  these  algorithms  are  their  methods 
for  computing  the  signed  distance  from  each  voxel 
(volume  element)  to  the  closest  surface. 

Hoppe  et  al.  [Hoppe  et  al.,  1992]  were  the  first  to 
propose  constructing  3D  surface  models  by  applying 
the  marching-cubes  algorithm  [Lorensen  and  Cline, 
1987]  to  a  discrete,  implicit-surface  function  gener¬ 
ated  from  a  set  of  range  images.  Their  algorithm 
computes  the  signed  distance  function  from  a  set  of 
3D  points.  Much  of  their  algorithm  works  to  in¬ 
fer  local  surface  approximations  from  the  cloud  of 
points.  Nearest-neighbor  search  of  the  inferred  sur¬ 
faces  is  used  to  compute  the  signed  distance  from 
each  voxel  to  the  surface  of  the  point  set. 

Curless  and  Levoy  [Curless  and  Levoy,  1996]  followed 
Hoppe’s  general  scheme  with  a  few  significant  depar¬ 
tures.  First,  their  method  was  geared  towards  using 
3D  data  acquired  from  range  images.  Rather  than 
performing  a  simple  search  for  the  closest  point  from 
a  voxel’s  center  to  determine  the  signed  distance. 
They  take  a  weighted  average  of  the  signed  dis- 
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tances  from  the  voxel  center  to  range-iniage  points 
whose  image  rays  intersect  the  voxel  integrating 
the  signed  distance  along  these  rays  through  the  vol¬ 
ume. 

The  method  most  similar  to  our  work  is  that  of 
Hilton  et  al.  [Hilton  et  al,  1996].  As  in  our  work 
and  Curless  and  Levoy’s  work,  Hilton  et  al.  generate 
a  volumetric  implicit-surface  representation  from  a 
number  of  triangle  sets  obtained  from  range  images. 
Similarly  to  our  algorithm,  Hilton’s  method  uses  a 
nearest-neighbor  search  to  find  the  surface  points 
from  each  view  which  are  closest  to  a  given  voxel  s 
center.  Heuristics  are  used  to  determine  which  clos¬ 
est  points  to  use  in  computing  the  signed  distance. 

The  main  limitation  of  the  above  algorithms  is  that 
they  do  not  compensate  for  noise  or  extraneous  point 
data— the  data  is  assumed  to  be  part  of  the  object 
and  noise  is  assumed  to  be  negligible.  In  addition, 
each  of  these  methods  can  suffer  from  inaccuracy 
due  to  their  integration  strategy  (for  more  details 
see  [Wheeler,  1996]). 

There  have  been  several  other  approaches  to  this 
modeling  problem;  most  notably,  Soucy  and  Lau- 
rendre  [Soucy  and  Laurendeau,  1992]  and  Turk  and 
Levoy  [Turk  and  Levoy,  1994]  presented  methods 
for  piecing  together  sets  of  triangulated  surfaces. 
Both  methods  marked  significant  advancements  in 
the  state  of  the  art,  but  often  perform  poorly  with 
respect  to  noise  and  alignment  errors  in  the  data. 

2  Approach 

The  problem  we  are  tackling  in  this  paper  is  to  build 
a  3D  model,  a  unified  surface  representation,  from  a 
number  of  range  images  of  an  object.  To  build  a  3D 
surface  model  from  multiple  range  images,  we  must 
address  the  following  problems: 

•  View  alignment:  To  merge  the  data,  the  data 
from  all  the  images  must  be  in  the  same  coor¬ 
dinate  system. 

•  Data  merging:  We  need  to  merge  all  the  image 
data  while  eliminating  or  greatly  reducing  the 
effects  of  noise  and  extraneous  data. 

Our  solution  makes  use  of  a  volumetric  represen¬ 
tation  to  avoid  difficulties  associated  with  topology. 

We  will  show  how  the  volumetric  representation  sirn- 
plifies  our  data-merging  problem— virtually  elimi¬ 
nating  the  topology  issue.  The  volumetric  repre¬ 
sentation  can  be  conveniently  converted  into  a  tri¬ 
angulated  mesh  representation  with  little  loss  of  ge¬ 
ometric  accuracy.  The  merging  problem  is  then  a 
matter  of  converting  our  input  surface  data  to  the 
volumetric  representation. 

The  conversion  problem  is  exacerbated  by  the  fact 
that  input  surface  data  from  real  sensors  (e.g.,  range 
sensors  or  stereo)  is  noisy  and,  in  fact,  will  contain 

912 


surfaces  that  are  not  part  of  the  object  we  are  inter¬ 
ested  in  modeling.  Our  method  for  merging  the  sur¬ 
faces  into  a  volumetric  representation  takes  full  con¬ 
sideration  of  these  facts  to  best  take  advantage  of  the 
multiple  observations  to  smooth  out  the  noise  and 
eliminate  undesired  surfaces  from  the  final  model. 

One  important  issue  which  we  do  not  address  in  this 
work  is  how  to  select  views  in  order  to  best  cover 
the  surface.  The  human  operator  determines  the 
number  of  views  and  the  object  orientation  for  each 
view. 

The  rest  of  this  paper  provides  the  details  of  our 
solutions  to  these  problems  which  combine  to  form 
a  practical  method  for  building  3D  surface  models 
from  range  images  of  an  object. 

3  View  Alignment 

After  taking  several  range  images  of  aii  object  and 
converting  them  to  surfaces  (described  in  [Wheeler, 
1996]),  we  need  to  eventually  merge  all  these  sur¬ 
faces  into  a  single  model.  The  problem  is  that  each 
view  is  taken  from  a  different  coordinate  system  with 
respect  to  the  object.  In  order  to  compare  or  match 
the  data  from  different  views,  we  must  transform 
all  the  data  into  the  same  coordinate  system  with 
respect  to  the  object. 

There  are  several  ways  we  can  approach  this  align¬ 
ment  problem— each  requiring  varying  levels  of  hu¬ 
man  interaction:  manual  alignment,  semi-automatic 
alignment,  and  automatic  alignment.  As  view  align¬ 
ment  is  not  the  central  focus  of  this  work,  we  use 
controlled  motion  with  calibration  as  it  is  currently 
the  most  practical  option  for  an  automatic  solution. 
In  our  experimental  setup,  we  calibrate  two  axes  of 
a  Unimation  Puma  robot  with  respect  to  a  range¬ 
sensor  coordinate  frame.  We  can  then  mount  the  ob¬ 
ject  on  the  robot’s  end  effector  and  acquire  images 
of  an  object  at  arbitrary  orientations  with  known 
positions. 

From  this  point  we  assume  that  the  views  are 
aligned.  Next,  we  consider  the  problem  of  merg¬ 
ing  all  the  data  from  these  views  into  a  single  model 
of  the  object’s  surface. 

4  Data  Merging 

Given  a  number  of  triangle  sets  which  are  aligned 
with  respect  to  the  desired  coordinate  system,  we 
are  now  faced  with  the  task  of  taking  many  trian¬ 
gulated  surfaces  in  3D  space  and  converting  them 
to  a  triangle  patch  surface  model.  Even  if  we  are 
given  perfect  sets  of  triangulated  surfaces  from  each 
view  which  are  more  or  less  perfectly  aligned,  the 
merging  problem  is  difficult.  The  problem  is  that 
it  is  difficult  to  determine  how  to  connect  triangles 
from  different  surfaces  without  knowing  the  actual 
surface  beforehand.  There  are  an  exponential  num- 


ber  of  ways  to  connect  two  triangulated  surfaces  to¬ 
gether,  some  acceptable  and  some  not  acceptable. 
This  problem  is  exacerbated  by  noise  in  the  data 
and  errors  in  the  alignment.  Not  only  does  the  de¬ 
termination  of  connectedness  become  more  difficult, 
but  now  the  algorithm  must  also  consider  how  to 
eliminate  the  noise  and  small  alignment  errors  from 
the  resulting  model.  Recently,  however,  several  re¬ 
searchers  [Hoppe  et  ai,  1992,  Curless  and  Levoy, 
1996,  Hilton  et  ai,  1996]  have  moved  from  trying 
to  connect  together  surface  patches  from  different 
views  to  using  volumetric  methods  which  hide  the 
topological  problems — making  the  surface-merging 
problem  more  tractable.  In  the  next  section  we  dis¬ 
cuss  the  volumetric  method  which  we  use  to  solve 
the  surface-merging  problem. 

4.1  Volumetric  Modeling  and 
Marching  Cubes 

Recently,  the  marching-cubes  algorithm  [Lorensen 
and  Cline,  1987],  an  algorithm  developed  for  graph¬ 
ics  modeling  applications,  has  made  volumetric 
modeling  more  useful  by  virtually  eliminating  the 
blocky  nature  of  occupancy  grids.  The  representa¬ 
tion  used  by  the  marching-cubes  algorithm  is  slightly 
more  complicated  than  the  occupancy-grid  represen¬ 
tation.  Instead  of  storing  a  binary  value  in  each 
voxel  to  indicate  if  the  cube  is  empty  or  filled,  the 
marching-cubes  algorithm  requires  that  the  data  in 
the  volume  grid  are  samples  of  an  implicit  surface.  In 
each  voxel,  we  store  the  signed  distance,  /(x),  from 
the  center  point  of  the  voxel,  x,  to  the  closest  point 
on  the  object’s  surface.  The  sign  indicates  whether 
the  point  is  outside,  /(x)  >  0,  or  inside,  /(x)  <  0, 
the  object’s  surface,  while  /(x)  =  0  indicates  that  x 
lies  on  the  surface  of  the  object. 

The  marching-cubes  algorithm  constructs  a  surface 
mesh  by  “marching”  around  the  cubes  while  fol¬ 
lowing  the  zero  crossings  of  the  implicit  surface 
/(x)  =  0.  The  location  of  the  surface  can  be  inter¬ 
polated  by  examining  the  signed  distances  of  neigh¬ 
boring  voxels.  Thus,  the  resulting  surface  will  be  rel¬ 
atively  smooth  and  the  accuracy  of  the  surface  will 
be  greater  than  the  resolution  of  the  volume  grid. 

The  marching-cubes  algorithm  and  the  volumetric 
implicit-surface  representation  provide  an  attrac¬ 
tive  alternative  to  other  conceivable  mesh-merging 
schemes.  First,  they  eliminate  the  global  topology 
problem — how  the  various  surfaces  are  connected — 
for  merging  views.  The  representation  can  model 
objects  of  arbitrary  topology  as  long  as  the  grid  sam¬ 
pling  is  fine  enough  to  capture  the  topology.  Most 
importantly,  the  problem  of  creating  the  volumet¬ 
ric  representation  can  be  reduced  to  a  single,  simple 
question: 

What  is  the  signed  distance  between  a 
given  point  and  the  surface? 
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The  given  point  is  typically  the  center  of  a  voxel,  but 
we  don’t  really  care.  If  we  can  answer  the  question 
for  an  arbitrary  point,  then  we  can  use  that  same 
question  at  each  voxel  in  the  volume. 

Now  we  may  focus  on  a  more  easily  defined  problem: 
How  do  we  compute  /(x)?  The  real  problem  under¬ 
lying  our  simple  question  is  that  we  do  not  have  a 
surface;  we  have  many  surfaces,  and  some  elements 
of  those  surfaces  do  not  belong  to  the  object  of  inter¬ 
est  but  rather  are  artifacts  of  the  image  acquisition 
process  or  background  surfaces.  In  the  next  subsec¬ 
tion,  we  present  an  algorithm  that  answers  the  ques¬ 
tion  and  does  so  reliably  in  spite  of  the  existence  of 
noisy  and  extraneous  surfaces  in  our  data. 

4.2  Consensus-Surface  Algorithm 

In  this  section,  we  will  answer  the  question  of  how 
to  compute  the  signed  distance  function  /(x)  for  ar¬ 
bitrary  points  X  when  given  N  triangulated  surface 
patches  from  various  views  of  the  object  surface.  We 
call  our  algorithm  the  consensus-surface  algorithm. 

We  can  break  down  the  computation  of  /(x)  into 
two  steps: 

•  Compute  the  magnitude:  compute  the  dis¬ 
tance,  I  /(x)  I,  to  the  nearest  object  surface 
from  X 

•  Compute  the  sign:  determine  whether  the 
point  is  inside  or  outside  of  the  object 

We  are  given  N  triangle  sets — one  set  for  each  range 
image  of  our  object — which  are  aligned  in  the  same 
coordinate  system.  The  triangle  sets  are  denoted 
by  Ti,  where  i  =  The  union  of  all  triangle 

sets  is  denoted  by  T  =  [J,.  I).  Each  triangle  set,  Ti, 
consists  of  some  number  n,-  of  triangles  which  are 
denoted  by  r,  j,  where  j  =  1, ...,  n,-. 

If  the  input  data  were  perfect  (i.e.,  free  of  any  noise 
or  alignment  errors  in  the  triangle  sets  from  each 
view),  then  we  could  apply  the  following  naive  al¬ 
gorithm,  Algorithm  ClosestSignedDistance,  to  com¬ 
pute  /(x): 

Algorithm  ClosestSignedDistance 
Input:  point  x 
Input:  triangle  set  T 
Output;  the  signed  distance  d 

1.  (p,h)  <—  ClosestSurface(x,T) 

2.  df-||x-p|| 

3.  if  (n  •  (x  -  p)  <  0) 

4.  then  d  < - d 

5.  return  d 

where  Algorithm  ClosestSurface  returns  the  point, 
p,  and  its  normal,  n,  such  that  p  is  the  closest  point 
to  X  from  all  points  on  triangles  in  the  triangle  set 
T. 

The  naive  algorithm  for  /(x)  finds  the  nearest  tri- 


angle  from  all  views  and  uses  the  distance  to  that 
triangle  as  the  magnitude  of  /(x).  The  normal  of 
the  triangle  can  be  used  to  determine  whether  x  is 
inside  or  outside  the  surface.  If  the  normal  of  the 
closest  surface  point  is  directed  towards  x,  then  x 
must  be  outside  the  object  surface. 

Again,  the  naive  algorithm  will  work  for  perfect 
data.  However,  we  must  consider  what  happens 
when  we  try  this  idea  on  real  data.  The  first  ar¬ 
tifact  of  real  sensing  and  small  alignment  errors  is 
that  we  no  longer  have  a  single  surface,  but  several 
noisy  samples  of  a  surface  (see  Figure  4  in  the  results 
section  for  an  example  of  the  type  of  noise  that  may 
be  present).  We  are  now  faced  with  choices  on  how 
to  proceed.  Clearly,  choosing  the  nearest  triangle 
(as  in  Algorithm  ClosestSignedDistance)  will  give  a 
result  as  noisy  as  the  constituent  surface  data.  For 
example,  a  single  noisy  bump  from  one  view  can  re¬ 
sult  in  a  bump  on  the  final  model. 

Inconsistent  values  for  the  implicit  distances  will  ap¬ 
pear  when  a  voxel  center  is  on  or  near  a  surface, 
since  the  samples  will  be  randomly  scattered  about 
the  real  surface  location.  It  may  become  difficult  to 
differentiate  between  several  observations  of  a  single 
surface  or  a  thin  wall.  This  is  especially  a  problem 
if  the  noise  is  of  similar  scale  to  the  voxel  size. 

A  more  sinister  problem  for  the  naive  algorithm  ap¬ 
plied  to  real  images  is  the  existence  of  noise  and  ex¬ 
traneous  data.  For  example,  it  is  not  uncommon  to 
see  triangles  sticking  out  of  a  surface  or  other  trian¬ 
gles  that  do  not  belong  to  the  object.  This  can  occur 
due  to  sensor  noise,  quantization,  specularities  and 
other  possibly  systematic  problems  of  range  imag¬ 
ing.  This  makes  it  very  easy  to  infer  the  incorrect 
distance  and  more  critically  the  incorrect  sign,  which 
will  result  in  very  undesirable  artifacts  in  the  final 
surface.  One  badly  oriented  triangle  can  result  in  a 
voxel  which  is  assigned  /(x)  with  the  incorrect  sign. 
The  result  will  be  in  a  hole  rising  out  of  the  surface 
produced  by  the  marching-cubes  algorithm. 

Our  solution  to  these  problems  is  to  estimate  the 
surface  locally  by  averaging  the  observations  of  the 
same  surface.  The  trick  is  to  specify  a  method 
for  identifying  and  collecting  all  observations  of  the 
same  surface. 

Nearby  observations  are  compared  using  their  loca¬ 
tion  and  surface  normal.  If  the  location  and  normal 
are  within  a  predefined  error  tolerance  (determined 
empirically),  we  can  consider  them  to  be  observa¬ 
tions  of  the  same  surface.  Given  a  point  on  one  of 
the  observed  triangle  surfaces,  we  can  search  that  re¬ 
gion  of  3D  space  for  other  nearby  observations  from 
other  views  which  are  potentially  observations  of  the 
same  surface.  This  search  for  nearby  observations 
can  be  done  efficiently  using  k-d  trees  [Friedman  et 
ai,  1977]. 

If  an  insufficient  number  of  observations  of  a  given 
surface  are  found,  then  the  observation  can  be  dis- 
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carded  as  isolated/untrusted  and  the  search  can  con¬ 
tinue.  Thus,  we  are  requiring  a  quorum  of  obser¬ 
vations  before  using  them  to  build  our  model.  The 
quorum  of  observations  can  then  be  averaged  to  pro¬ 
duce  a  consensus  surface.  This  process  virtually 
eliminates  the  problems  described  previously  (with 
respect  to  the  naive  algorithm). 

As  an  improvement  over  using  an  equally  weighted 
voting  scheme,  we  can  assign  a  confidence  value  w 
to  each  input  surface  triangle:  weighting  the  surface 
points/triangles  from  a  range  image  by  the  cosine 
of  the  angle  between  the  viewing  direction  and  the 
surface  normal.  This  is  simply  computed  by 

a;  =  V  •  n 

where  v  and  n  are  the  viewing  direction  (of  the  given 
point  in  the  range  image)  and  normal,  respectively, 
of  the  given  triangle.  The  consensus  can  now  be 
measured  as  a  sum  of  confidence  measures  and  the 
quorum  is  over  this  weighted  sum.  The  rationale 
is  that  two  low-confidence  observations  should  not 
have  the  same  impact  on  the  result  as  two  high- 
confidence  observations.  We  can  now  specify  the 
consensus-surface  algorithm: 

Algorithm  ConsensusSignedDistance 
Input:  point  x 
Input:  triangle  set  T 
Output:  the  signed  distance  d 

1.  (p,h)  ClosestConsensusSurface{x,T) 

2.  d<-||x-p|| 

3.  if  (n  •  (x  -  p)  <  0) 

4.  then  d  < - d 

5.  return  d 

The  only  change  from  Algorithm  ClosestSignedDis¬ 
tance  is  that  Algorithm  ConsensusSignedDistance 
computes  the  closest  consensus-surface  point  and  its 
normal  in  line  1.  The  algorithm  for  computing  the 
closest  consensus-surface  point  and  its  normal  is  as 
follows: 

Algorithm  ClosestConsensusSurface 
Input:  point  x 

Input:  triangle  sets  Ti,  i  =  1..N 

Output:  the  point  and  normal  vector  pair  (p,h) 

1.  Oget  0 

(♦  0,et  is  the  set  of  non-consensus  neighbors  +) 

2.  Cset  ^  0 

(♦  C,et  is  the  set  of  consensus  neighbors  +) 

3.  for  each  triangulated  set  T* 

4.  do  (p,n)  4-  ClosestSurface{x,Ti) 

^  (p,n,u;)^ 

ConsensusSurface{p,  n,  T) 


if  UJ  ^  Oquorum 


C,et  U(p,n,w) 
0,etU(p,n,w) 


then  Cget 
else  Oset 
ifCget^a 

then  (p,n,w)  <— 

argmin^p_ii_„^gc...  II  ^  "  P  H 

else  (p,  h,  a;)  ■(-  argmax^p^^^„^go^_^^  w 


14.  return  (p,  n,w) 

Algorithm  ClosestConsensusSurface  examines  the 
closest  point  in  each  view  and  searches  for  its  con¬ 
sensus  surface  if  one  exists.  After  computing  the 
closest  consensus  surfaces  for  each  view,  it  chooses 
the  closest  of  those  from  the  consensus  set  Cget-  Cset 
contains  those  locally  averaged  surfaces  whose  obser¬ 
vations’  confidence  values  sum  to  at  least  Oquorum  ■ 
Note  that  two  consensus  surfaces  are  not  differenti¬ 
ated  based  on  their  confidence  sum  w  but  rather  on 
their  proximity  to  x.  If  none  of  the  consensus  sur¬ 
faces  exist,  the  algorithm  selects  the  average  surface 
which  has  the  highest  summed  confidence  out  of  set 
Oget- 

For  completeness,  we  outline  Algorithm  Consen- 
susSurface  which  is  required  by  line  5  of  Algo¬ 
rithm  ClosestConsensusSurface.  Algorithm  Consen- 
susSurface  basically  finds  all  surface  observations 
that  are  sufficiently  similar  to  the  given  point  and 
normal.  These  observations  are  then  averaged  to 
generate  a  consensus  surface  for  the  input  surface. 
This  algorithm  relies  on  the  predicate 

SameSurface((po,  no)  ,(pi,ni))  = 
f  True  (II  Po -Pi  ||<  A  (no -ni  >  cos0„) 
False  otherwise 

(1) 

which  determines  whether  two  surface  observations 
are  sufficiently  close  in  terms  of  location  and  normal 
direction,  where  Sa  is  the  maximum  allowed  distance 
and  9„  is  the  maximum  allowed  difference  in  normal 
directions.  Now  we  present  the  pseudo  code  for  Al¬ 
gorithm  ConsensusSurface: 

Algorithm  ConsensusSurface 
Input:  point  x 
Input:  normal  v 
Input:  triangle  set  T  =  (J,-  7) 

Output:  the  point,  normal  vector,  and  the  sum  of 
the  observations  confidences  (p,h,w) 

1 .  p  ^ —  n  i —  CJ  i —  0 

2.  for  Ti  CT 

3.  do  (p',n',w')  <-  ClosestSurface{x,Ti) 

4.  if  SameSurface((x,  v) ,  (p',  n')) 

5.  then  p  p  -|-  w'p' 

6.  nf-n-fw'h' 

7.  u;  <-Lj  +u>' 

8-  P  ^ 

10.  return  (p,  n,ci;) 

Note  that  in  Algorithm  ConsensusSurface,  the  defi¬ 
nition  of  Algorithm  ClosestSurface  was  slightly  mod¬ 
ified  to  also  return  the  confidence  w'  of  the  closest 
surface  triangle. 

We  refer  to  this  algorithm  as  a  whole  as  the 
consensus-surface  algorithm.  The  following  condi¬ 
tions  are  assumed: 

1.  Each  part  of  the  surface  is  covered  by  a  num¬ 


ber  of  observations  whose  confidences  add  up 
to  more  than  dquorum- 

2.  No  set  of  false  surfaces  with  a  sufficient  summed 
confidence  will  coincidentally  be  found  to  be 
similar  (following  the  definition  of  Equation  1) 
or  this  occurrence  is  sufficiently  unlikely. 

3.  Given  N  surface  views,  the  real  surface  is  clos¬ 
est  to  X  in  at  least  one  view. 

If  these  assumptions  are  violated,  mistakes  in  the 
surface  mesh  will  result.  From  our  experiments,  a 
quorum  requirement,  Oquorum,  of  1.5  to  3.0  is  usu¬ 
ally  sufficient  to  eliminate  errors  given  a  reasonable 
number  of  views  of  the  object. 

4.3  Accuracy  and  Efficiency 

Volumetric  modeling  involves  a  tradeoff  between  ac¬ 
curacy  and  efficiency.  To  achieve  desired  accuracy 
we  must  use  a  dense  sampling  of  the  volume.  Since 
the  memory  requirements  of  a  volume  grid  is  cubic 
with  respect  to  the  density  of  the  sampling  for  volu¬ 
metric  modeling,  the  first  thing  that  gets  sacrificed 
is  accuracy.  With  the  straightforward  use  of  volume 
grids,  resources  are  wasted  by  computing  signed  dis¬ 
tances  in  parts  of  the  volume  that  are  distant  from 
the  surface.  For  our  purposes,  the  only  voxels  that 
we  need  to  examine  are  those  near  the  surface,  a 
small  fraction  of  the  entire  volume  grid.  Curless 
and  Levoy  [Curless  and  Levoy,  1996]  alleviate  this 
problem  by  run  length  encoding  each  2D  slice  of  the 
volume. 

The  octree  representation  [Meagher,  1980]  solves 
both  the  accuracy  and  the  efficiency  problems  while 
keeping  the  algorithm  implementation  simple.  In¬ 
stead  of  iterating  over  all  elements  of  the  voxel 
grid,  we  can  apply  a  recursive  algorithm  on  an  oc¬ 
tree  that  samples  more  finely  in  octants  only  when 
necessary — near  the  surface  of  the  object.  The  oc¬ 
tree  in  practice  reduces  the  0(n^)  storage  and  com¬ 
putation  requirement  to  O(n^)  since  the  surfaces  of 
3D  objects  are,  in  general,  2D  manifolds  in  a  3D 
space. 

4.4  Cost  of  the  Consensus-Surface 
Algorithm 

To  get  a  rough  estimate  of  the  cost  of  our  model¬ 
building  algorithm,  let  us  assume  that  there  are 
N  views  being  merged  and  that  for  each  view  the 
triangle  set  7)  has  n  triangles.  Algorithm  Con¬ 
sensusSurface  computes  the  closest  surface  for  each 
view  which  on  average  will  be  an  0(iV  log  n)  opera¬ 
tion  assuming  k-d  trees  [Friedman  et  al,  1977]  are 
used.  Algorithm  ClosestConsensusSurf ace  compuies 
the  closest  surface  and  then  the  respective  consen¬ 
sus  surface  for  each  view,  which  adds  up  to  a  cost 
of  0{N^logn).  Assuming  that  an  M  x  M  x  M 
voxel  grid  is  used,  the  modeling  algorithm  will  cost 
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0{M^N^  logn).  However,  if  octrees  are  used  we  may 
loosely  eissume  that  the  number  of  voxels  or  octree 
elements  which  are  evaluated  will  be  proportional  to 
the  surface  area  of  the  object,  reducing  the  average 
cost  to  0{M^N^logn). 

5  Experimental  Results 

Here  we  present  some  experimental  results  of  our  im¬ 
plementation  of  the  3D  object  modeling  algorithm 
described  in  this  paper.  For  our  experiments,  we  se¬ 
lected  a  variety  of  objects  to  model  using  our  system. 
We  assume  that  the  objects  are  rigid  and  opaque.  In 
this  paper,  we  show  the  results  obtained  for  model¬ 
ing  a  toy  duck.  See  [Wheeler,  1996]  for  a  complete 
description  of  our  experiments. 

For  the  example  presented  here  we  used  48  range 
images,  each  containing  256  x  240  pixels  with  each 
pixel  containing  a  3D  coordinate.  The  resolution  of 
data  is  approximately  1  mm,  and  the  accuracy  is  on 
the  order  of  0.5  mm. 

Figures  1-2  show  the  results,  including  an  intensity 
image  of  the  object,  a  close-up  of  some  of  the  tri¬ 
angulated  range  images  used  as  input  to  our  algo¬ 
rithm  (shaded  to  better  indicate  the  roughness  of 
the  original  data),  a  slice  of  the  volume  grid  where 
the  grey-scale  indicates  the  proximity  to  a  surface 
point  (black  closest,  white  furthest),  and  three  views 
of  the  resulting  triangulated  model.  For  this  ex¬ 
ample,  the  input  images  contained  555,000  trian¬ 
gles  and  the  resulting  model  contained  27,000.  The 
finest  resolution  of  the  voxel  grid  was  1.8  mm  and 
approximately  4%  of  the  128  x  128  x  128  volume  grid 
was  expanded  by  the  octree.  Parameters  used  were. 

=  2.25,  5,  =  3,  and  =  45  Computing 
time  was  52  minutes  on  an  SGI  Indy  5  workstation 
(a  124  MIPS/49.0  MFLOPS  machine). 

As  an  example  of  what  the  naive  algorithm.  Algo¬ 
rithm  ClosestSignedDistance  of  Section  4.2,  would 
produce  we  show  the  example  of  the  result  of  the 
naive  algorithm  on  the  duck  data  set  in  Figure  3. 
Notice  how  many  extraneous  surfaces  exist  near  the 
duck  from  the  input  range-image  data.  The  naive 
algorithm  fails  because  it  trusts  that  every  surface 
observation  is  an  accurate  observation  of  the  object 
surface.  As  can  be  seen  from  the  sample  range  data 
of  the  duck  in  Figure  1,  this  is  not  the  case. 

To  more  clearly  illustrate  the  accuracy  of  our  mod¬ 
eling  algorithm.  Figure  4  shows  a  cross  section  of 
the  final  model  and  the  original  input  range-image 
data.  This  example  demonstrates  the  ability  of  our 
consensus-surface  algorithm  to  accurately  locate  the 
surface  in  very  noisy  data. 

6  Conclusion 

We  have  described  a  method  to  create  a  triangulated 
surface  mesh  from  N  range  images.  Robotic  calibra¬ 


tion  is  used  to  acquire  images  of  the  object  under 
known  transformations,  allowing  us  to  align  the  im¬ 
ages  into  one  coordinate  frame  reliably.  Our  method 
for  data  merging  takes  advantage  of  the  volumet¬ 
ric  implicit-surface  representation  and  the  marching- 
cubes  algorithm  to  eliminate  topological  problems. 

The  main  contribution  of  this  paper  is  our  algo¬ 
rithm  for  merging  data  from  multiple  views:  the 
consensus-surface  algorithm  which  attempts  to  an¬ 
swer  the  question 

What  is  the  closest  surface  to  a  given 
point? 

With  the  answer  to  this  question,  we  can  easily  com¬ 
pute  the  signed  distance  /(x)  correctly.  While  other 
known  methods  also  implicitly  address  this  question, 
their  algorithms  do  not  capture  the  essence  of  the 
problem  and  produce  answers  by  taking  averages  of 
possibly  unrelated  observations.  In  contrast,  our  al¬ 
gorithm  attempts  to  justify  the  selection  of  observa¬ 
tions  used  to  produce  the  average  by  finding  a  quo¬ 
rum  or  consensus  of  locally  coherent  observations. 
This  process  eliminates  many  of  the  troublesome  ef¬ 
fects  of  noise  and  extraneous  surface  observations  in 
our  data.  Consensus  surfaces  can  be  computed  in¬ 
dependently  for  any  point  in  the  volume.  This  fea¬ 
ture  makes  it  very  easy  to  parallelize  and  allows  us 
to  straightforwardly  use  the  octree  representation. 
The  octree  representation  enables  us  to  model  ob¬ 
jects  with  high  accuracy  with  greatly  reduced  com¬ 
putation  and  memory  requirements. 

We  have  presented  the  results  of  our  modeling  al¬ 
gorithm  on  a  number  of  example  problems.  These 
results  demonstrate  that  our  consensus-surface  algo¬ 
rithm  can  construct  accurate  geometric  models  from 
rather  noisy  input  range  data  and  somewhat  imper¬ 
fect  alignment. 
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Abstract 

This  paper  describes  a  method  for  constmcting  models 
from  range  data  that  combines  both  surface  and 
volumetric  modeling  techniques.  Each  of  a  series  of 
range  images  from  different  viewpoints  is  modeled  by 
a  mesh  surface,  which  is  then  extmded  to  form  a  solid 
representing  the  volume  of  space  occluded  from  the 
sensor  from  that  particular  viewpoint.  Set  intersection 
is  used  to  integrate  the  information  from  different 
views.  Results  for  two  highly  convex  parts  are  shown. 
An  important  benefit  of  this  method  is  that  models  can 
be  created  from  a  small  number  of  views  and  that  the 
modeling  is  done  in  continuous  volumetric  space. 

1.  Introduction 

The  use  of  range  data  to  drive  the  modeling  process 
has  been  steadily  increasing  over  recent  years.  This  is 
due  not  only  to  an  increase  in  the  availability  and  a 
decrease  in  price  of  range  imaging  devices,  but  also 
from  an  increase  in  the  scale  of  use.  Range  data  is 
now  acquired  in  domains  such  as  cartography  and 
medicine,  and  is  used  in  manufacturing  processes 
such  as  inspection  and  in  computer  graphics  as  a 
method  of  acquiring  models  of  real-world  objects.  To 
be  useful  these  systems  must  be  able  to  merge  data 
acquired  from  multiple  viewpoints  to  complete  the 
modeling  task.  This  paper  describes  a  system  that 
performs  this  task,  incrementally  building  solid  mod¬ 
els  from  multiple  range  images.  It  motivates  the  inte¬ 
gration  of  both  mesh  surface  and  volumetric  models 
and  the  generation  of  a  topologically  correct  3-D 
model  at  each  stage  of  the  modeling  process,  allowing 
the  use  of  well-defined  geometric  algorithms  to  per- 
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form  the  merging  task.  One  of  the  key  benefits  of 
using  a  correct  solid  model  at  each  stage  in  the  pro¬ 
cess  is  that  we  can  guarantee  a  water-tight  model 
without  holes,  which  is  important  for  rapid  prototyp¬ 
ing  and  reverse  engineering  tasks.  These  solid  models 
can  also  be  used  to  automatically  plan  the  next  view 
during  the  model  acquisition  stage.  We  have  been 
able  to  use  our  previous  work  in  sensor  planning  for 
inspection  tasks  to  plan  the  next  view  of  an  object  to 
be  modeled,  thus  reducing  the  number  of  scans 
needed  to  recreate  a  correct  model  [9]. 

2.  Relation  to  Previous  Work 

Recent  research  on  the  acquisition,  modeling  and 
merging  process  includes  the  REFAB  system,  which 
allows  a  user  to  specify  approximate  locations  of 
machining  features  on  a  range  image  of  a  part;  the 
system  then  produces  a  best  fit  to  the  data  using  previ¬ 
ously-identified  features  and  domain-specific  knowl¬ 
edge  as  constraints  [12].  The  IVIS  system  uses  an 
octree  to  represent  the  seen  and  unseen  parts  of  each 
of  a  set  of  range  images  and  uses  set-theoretic  opera¬ 
tors  to  merge  the  octrees  into  a  final  model  [11].  Very 
recently  octrees  have  again  been  used  to  represent 
range  data,  in  combination  with  an  isosurface  extrac¬ 
tion  algorithm  that  produces  very  high  resolution 
results,  albeit  from  a  large  number  of  images  [4]. 
Methods  that  use  a  mesh  surface  to  model  and  inte¬ 
grate  each  of  a  set  of  range  images  [14]  [10]  [8]  or  to 
model  a  single,  complete  point  sampling  [5]  have  also 
proven  useful  in  this  task. 

The  popularity  of  mesh-based  methods  is  due  to  their 
combination  of  simplicity  and  representational  flexi¬ 
bility.  Free-form  parts  may  be  modeled  accurately 
without  the  need  for  multiple  parametric  shape  repre¬ 
sentations,  and  they  may  easily  incorporated  into  a 
CAD/CAM  system.  In  addition,  they  lend  themselves 
well  to  incremental  methods  that  are  required  in  situa¬ 
tions  that  utilize  model  refinement  or  planning  stages. 
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Methods  that  use  surface  models  based  on  meshes, 
such  as  [14],  however,  suffer  from  two  important 
problems.  First,  meshes  by  their  very  nature  tend  to 
introduce  holes  which  are  very  difficult  to  remove. 
Creation  of  water-tight  solids  is  an  important  end- 
product  of  these  modeling  systems  and  mesh-based 
methods  typically  require  significant  post-processing 
to  remove  these  holes.  Second,  when  surface  meshes 
are  joined  the  intersections  become  difficult  to  com¬ 
pute  and  can  introduce  error,  particularly  at  places 
where  the  meshes  overlap  but  do  not  actually  inter¬ 
sect.  An  interesting  way  to  alleviate  these  problems 
was  developed  in  [4],  which  uses  each  surface  mesh 
in  a  ray-casting  process  to  weight  vertices  in  an 
octree,  followed  by  iso-surface  extraction  to  recon¬ 
struct  a  closed  solid.  In  essence,  this  imposes  a  regu¬ 
lar  volumetric  stmcture  on  the  irregular  surface  mesh. 
The  benefit  of  the  volumetric  stmcture  is  that  it 
allows  identification  of  holes  as  well  as  methods  to 
fill  them.  While  the  results  of  this  system  are  quite 
good,  there  are  still  some  issues  that  need  to  be 
addressed,  one  of  which  is  how  the  system  behaves 
with  a  smaller  number  of  scans  from  widely  varying 
viewpoints.  Our  method  differs  in  that  we  represent 
the  volume  in  a  continuous  space  which  precludes  the 
need  for  discrete  voxel  stmcture.  We  constmct  a  solid 
for  each  range  image  that  represents  the  imaged  sur¬ 
face  and  the  space  this  surface  occludes  from  the  sen¬ 
sor.  Because  of  this  it  is  not  necessary  to  explicitly 
represent  “empty”  space,  and  it  is  not  necessary  to 
consider  separate  operations  to  fill  holes  in  the  model. 

The  method  uses  a  mesh  to  model  the  sensed  surface 
of  an  object,  and  then  sweeps  the  mesh  in  the  imaging 
direction  to  generate  a  solid  representation.  In  this 
regard  it  may  be  thought  of  as  an  integration  of  both 
mesh-based  reconstruction  techniques  and  work  that 
performs  edge  detection  and  projection  from  intensity 
images  [3]  [7].  The  model  created  from  each  imaging 
operation  is  a  solid  topologically-correct  CAD  model. 
A  benefit  of  this  is  that  it  allows  coarse  models  to  be 
created  from  a  small  number  of  scans  that  may  be 
acceptable  to  some  tasks,  for  example  3-D  FAX.  The 
method  is  an  incremental  one  that  allows  new  infor¬ 
mation  to  be  easily  integrated  as  it  is  acquired  into  a 
composite  model.  As  more  scans  are  merged,  the 
model’s  fidelity  increases.  Finally,  each  model  cre¬ 
ated  by  our  method  includes  information  about  the 
volume  of  occlusion,  which  describes  the  space 


occluded  from  the  sensor  by  the  object,  and  is  not 
present  in  systems  that  only  model  sensed  surfaces. 
The  occlusion  volume  is  a  key  component  of  many 
sensor  planning  methods  because  it  allows  the  system 
to  reason  about  what  has  not  been  seen,  but  it  has  not 
yet  been  integrated  sufficiently  into  mesh-based 
methods. 

3.  Building  a  Solid  from  A  Range  Image 

A  robotic  system,  comprised  of  a  Servo-Robot  laser 
rangefinder  attached  to  an  IBM  SCARA  robot,  is 
used  to  acquire  a  range  image  of  the  object  being 
modeled,  which  is  placed  on  a  motorized  rotation 
stage  (see  Figure  1).  The  result  of  each  scanning  pro¬ 
cess  is  an  NxM  range  image  from  a  particular  view¬ 
point,  the  direction  of  which  is  controlled  by  turntable 
rotation.  The  3D  point  data  from  the  range  image  are 
then  used  as  vertices  in  a  mesh  surface  model. 


Figure  1.  Experimental  setup  showing  robot  with 
attached  laser  rangefinder  (to  right)  and  turntable  (to 
left). 

3.1  Sweeping  a  Mesh  to  Extrude  a  Solid  Model 

Incremental  modeling  techniques  that  use  a  mesh  sur¬ 
face  model  can  not  form  a  closed  solid  until  a  set  of 
images  that  completely  cover  the  surface  of  the  object 
has  been  acquired.  The  primary  disadvantage  of  this 
is  that  it  prevents  the  use  of  a  planning  method  or  any 
other  procedure  that  requires  a  solid  model.  The  sec¬ 
ond  disadvantage  is  that  topological  information 
present  in  each  image  is  not  retained  since  all  inter¬ 
mediate  modeling  steps  are  represented  by  non-mani¬ 
fold  surfaces.  A  solution  to  this  problem  is  to 
“sweep”  the  mesh  surface  to  extrude  a  solid  model  of 
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both  the  imaged  object  surfaces  and  the  occluded  vol¬ 
ume.  This  method  may  be  stated  concisely  as: 

S  =  \^extrude{M  ^ 

Vi 

Each  triangular  mesh  element  Mj  is  swept  orthograph- 
ically  along  the  vector  of  the  rangefinder’s  sensing 
axis  until  it  comes  in  contact  with  a  far  bounding 
plane,  resulting  in  the  5-sided  solid  of  a  triangular 
prism  (Figure  2).  A  union  operation  is  then  applied  to 


Figure  2.  Example  of  a  mesh  sweep  operation, 
(left  to  right)  Mesh  surface,  mesh  surface  with  one 
eiement  swept,  and  mesh  surface  with  ali 
eiements  swept  and  unioned.  The  sensing 
direction  is  from  the  left. 

the  entire  set  of  prisms,  which  produces  a  polyhedral 
solid  consisting  of  three  sets  of  surfaces:  a  mesh- like 
surface  from  the  acquired  range  data,  a  number  of  lat¬ 
eral  faces  equal  to  the  number  of  vertices  on  the 
boundary  of  the  mesh  derived  from  the  sweeping 
operation,  and  a  planar  bounding  surface  that  caps 
one  end. 

This  method  is  particularly  simple  to  implement 
because  it  uses  two  operations  that  are  available  in 
virtually  every  3D  CAD  package:  linear  sweep  of  a 
triangle  with  no  draft  angle,  and  regularized  intersec¬ 
tion  of  axis-aligned  and  adjacent  prismatic  solids. 

4.  Merging  Single- View  Models 

Each  successive  sensing  operation  will  result  in  new 
information  that  must  be  merged  with  the  current 
model  being  built.  Merging  of  mesh-based  surface 
models  has  been  done  using  clipping  and  re-triangu- 
lation  methods.  These  methods  are  necessary  because 
these  meshes  are  not  closed,  and  because  of  this  spe¬ 
cialized  techniques  to  operate  on  non-manifold  sur¬ 
faces  of  approximately  continuous  vertex  density 
were  needed.  In  our  method  we  generate  a  solid  from 
each  viewpoint  which  allows  us  to  use  a  merging 
method  based  on  set  intersection.  Many  CAD  systems 
include  highly  robust  algorithms  for  set  operations  on 


solids,  and  our  algorithm  takes  advantage  of  this.  This 
is  of  critical  importance  in  this  application  for  the  fol¬ 
lowing  reasons:  the  high  density  of  the  range  images 
(and  therefore  the  small  size  of  many  of  the  mesh  ele¬ 
ments),  the  many  long  and  thin  lateral  surfaces,  and 
most  importantly  the  fact  that  many  of  these  models 
will  have  overlapping  surfaces  that  are  extremely 
close  to  each  other. 

The  merging  process  itself  starts  by  initializing  the 
sensed  “composite”  model  to  be  the  entire  bounded 
space  of  our  modeling  system.  The  information  deter¬ 
mined  by  a  newly  acquired  model  from  a  single  view¬ 
point  is  incorporated  into  the  composite  model  by 
performing  a  regularized  set  intersection  operation 
between  the  two. 

5.  Experimental  Results 

To  show  the  behavior  of  this  system,  the  reconstrac- 
tion  of  two  parts  is  shown.  For  both  parts,  a  small 
number  of  range  images  were  taken,  using  turntable 
rotation  to  change  the  sensor  viewpoint.  The  solids 
constructed  from  these  images  are  shown,  along  with 
the  final  model  constructed  from  merging  of  those 
solids.  The  final  models  would  then  be  prime  candi¬ 
dates  for  a  decimation  algortihm,  such  as  the  one  pre¬ 
sented  in  [2]. 

The  first  part  is  the  plastic  hand  controller  (shown  in 
Figure  3)  for  a  home  video  game  machine:  it  consists 
of  polygonal  and  curved  surfaces  at  varying  levels  of 
detail,  including  various  buttons  on  its  front  surface. 
Three  range  images  were  take  of  this  part,  and  swept 
into  the  solids  shown  in  Figure  4..  The  result  of  the 


Figure  3.  Photograph  of  the  video  game  controiier. 
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Figure  4.  Top:  solid  models  constructed  from  each  of 
three  range  images  of  the  game  controller,  each  120 
degrees  apart.  Bottom:  the  final  model. 


intersection  is  shown  at  the  bottom  of  Figure  4.  This 
model  has  been  used  as  input  to  a  rapid  prototyping 
machine  and  the  model  has  been  physically  built. 

The  second  model  is  that  of  a  strut-like  part,  which 
consists  of  curved  and  polygonal  surfaces,  large  self¬ 
occlusions,  and  two  through-hole  features,  as  shown 
in  Figure  5.  This  model  for  this  part  was  constructed 


Figure  5.  Photograph  of  the  strut  part. 


with  the  assistance  of  our  sensor  planning  system  to 
plan  views  for  hand-designated  targets  [9].  In  this 
case,  four  images  were  taken  at  irregular  angular  sep¬ 
arations,  the  models  of  which  are  seen  in  Figure  6. 
The  final  model  is  shown  in  Figure  7,  and  captures  the 
outer  surface  and  one  of  the  two  through-holes  well. 

6.  Discussion 

Although  this  method  is  simple  and  effectively  builds 
many  parts  using  only  a  few  sensing  operations,  there 
are  two  issues  that  make  the  reconstruction  process 
difficult.  The  first  problem  is  that  of  determining 
effective  next  viewpoints,  which  is  by  no  means  a 
problem  specific  to  our  modeling  method.  The  second 
problem  is  due  to  the  behavior  of  set  intersection  on 
models  built  from  sampled  data,  and  here  it  is  neces¬ 
sary  to  pay  more  attention  to  this  detail  than  is 
required  in  other  methods. 

To  effectively  determine  the  next  viewpoint  for  a 
sensing  operation,  it  is  necessary  to  combine  a  plan¬ 
ning  component  with  the  modeling  system.  The  plan¬ 
ning  component  we  are  integrating  into  our  modeling 
system  [9]  is  based  on  previous  work  on  the  sensor 
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Figure  6.  Four  models  of  the  strut  part,  with 
uneven  rotations  between  each  view. 

planning  problem  in  our  laboratory  [1][13].  The  sen¬ 
sor  planning  problem  is  that  of  computing  a  set  of 
sensor  locations  for  viewing  a  target  given  a  model  of 
an  object  or  scene,  a  sensor  model,  and  a  set  of  sens¬ 
ing  constraints.  The  planner  is  used  to  reason  about 
occlusion  to  compute  valid,  occlusion-free  view¬ 
points  given  a  specific  surface  on  the  model.  Our  sys¬ 
tem  labels  each  surface  as  a  sensed  model  surface  or 
one  caused  by  the  boundary  with  the  occluded  vol- 
ume.Using  this  information,  we  can  plan  sensor  posi¬ 
tions  for  the  next  view  that  reduce  the  occluded 
volume  but  do  not  suffer  from  model  self-occlusion. 
Once  an  unoccluded  sensor  position  for  the  specified 
surface  has  been  determined,  it  may  then  be  sensed, 
modeled,  and  integrated  with  the  composite  model. 
Thus,  the  method  is  target-driven  and  performed  in 
continuous  space.  As  the  incremental  modeling  pro¬ 
cess  proceeds,  regions  that  require  additional  sensing 
can  be  guaranteed  of  having  an  occlusion  free  view 
from  the  sensor  if  one  exists.  Other  viewing  con¬ 
straints  may  also  be  included  in  the  sensor  planning 
such  as  sensor  field  of  view,  resolution,  and  standoff 
distance. 

The  second  problem  described  above  is  more  specific 
to  the  use  of  set  intersection  as  a  tool  for  volumetric 
integration.  In  all  systems  it  is  assumed  that  the 
object’s  surface  is  “well  behaved”  with  respect  to  the 


Figure  7.  The  final  model  of  the  strut. 

distance  between  sample  points.  What  this  means  is 
that  for  the  object’s  surfaces  that  are  visible  to  the 
sensor  and  that  are  not  highly  inclined,  the  deviation 
from  the  mesh  element  between  adjacent  samples  is 
small.  Although  the  sampling  interval  depends  on  the 
distance  to  the  target,  these  values  are  typically  ~2- 
3mm  for  a  perpendicular  surface.  However,  there  may 
be  surfaces  that  are  highly  inclined  to  the  sensor, 
which  results  in  large  distances  between  samples,  and 
therefore  the  possibility  of  a  large  deviation  of  the 
object  surface  from  the  mesh  element  between  sam¬ 
ples.  Mesh-based  methods  that  use  only  a  surface 
model  handle  this  problem  by  disregarding  these  data, 
with  the  assumption  that  the  surface  there  will  be 
acquired  in  a  later  imaging  operation  from  another 
viewpoint.  In  our  method,  however,  this  can  have  a 
very  detrimental  effect  on  the  final  model. 

To  see  why  this  is  so,  consider  the  2-D  example  in 
Figure  8.,  which  represents  one  scan  line  from  the 
rangefinder.  Due  to  the  subsampling  of  the  scene,  an 
inappropriate  mesh  surface  (in  2-D,  this  is  a  line  seg¬ 
ment)  will  be  created  that  passes  through  the  interior 
of  the  object’s  true  geometry.  In  modeling  systems 
such  as  ours  that  rely  on  set  intersection  as  a  means  of 
integration,  this  is  unacceptable  because  once  a  set  of 
points  has  been  classified  as  “outside”  the  model, 
there  is  no  way  to  recapture  that  information  no  mat¬ 
ter  how  many  subsequent  images  correctly  include 
those  points. 

One  possible  solution  to  this  problem  would  be  dilate 
the  model  at  those  surfaces  where  the  problem 
occurs.  It  is  clear  from  Figure  8  that  the  inappropriate 
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Figure  8. 2-D  example  of  typical  surface  error  caused 
by  constructing  mesh  from  sampled  data  points. 

surfaces  occur  at  extremal  boundaries  in  the  scene, 
which  correspond  to  mesh  elements  for  which  the 
surface  normal  and  sensing  direction  vectors  are 
nearly  perpendicular.  It  should  be  possible  then  to 
dilate  the  mesh  at  these  surfaces  to  counteract  the 
effects  of  subsampling.  Because  these  surfaces  will 
always  require  more  sensing  in  any  case,  this  dilation 
does  not  affect  any  sensing  strategy.  We  are  currently 
exploring  this  and  volume  tolerancing  as  solutions  to 
this  problem. 

7.  Conclusions 

We  have  described  a  system  that  builds  a  3-D  CAD 
model  of  an  object  incrementally  from  multiple  range 
images.  It  motivates  the  generation  of  a  solid  model  at 
each  stage  of  the  modeling  process,  allowing  the  use 
of  well-defined  geometric  algorithms  to  perform  the 
merging  and  integration  task.  We  have  been  able  to 
create  models  of  a  number  of  different  objects  using 
just  three  or  four  scans.  As  we  increase  the  number  of 
scans,  the  model  fidelity  will  also  increase.  A  benefit 
of  our  method  is  that  the  volumes  created  can  be  used 
in  conjunction  with  our  previous  work  in  sensor  plan¬ 
ning  to  plan  the  next  view.  Finally,  we  are  refining 
this  system  to  improve  the  fidelity  of  the  models  to 
alleviate  problems  caused  by  using  set  intersection  as 
a  tool  for  volumetric  integration. 
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Abstract 

Reverse  engineering  techniques  can  be  used  to  aid 
in  the  manufacture  of  spare  parts  when  original  parts 
inventories  are  exhausted.  For  mechanical  parts, 
this  process  involves  sensing  the  geometry  of  an  ex¬ 
isting  part,  creating  a  geometric  model  of  the  part 
from  the  sensed  data,  and  then  passing  this  model 
to  an  appropriate  CAD/CAM  system  for  manufac¬ 
turing.  Sensing  errors  and  the  modeling  demands  of 
modem  CAD  systems  present  serious  challenges  to 
the  reverse  engineering  process.  We  describe  con¬ 
straints  that  can  be  used  to  simplify  the  process  and 
aid  in  the  constmction  of  higher  quality  models  than 
would  otherwise  be  possible. 

1  Introduction 

The  DOD  must  maintain  a  large  quantity  of  legacy 
hardware,  much  of  it  decades  old.  Part  inven¬ 
tories  are  frequently  exhausted  well  before  de¬ 
commissioning  of  the  relevant  pieces  of  equipment 
and  additional  spares  are  often  difficult  or  impos¬ 
sible  to  obtain  from  the  original  suppliers  of  the 
equipment.  Reverse  engineering  techniques  can 
be  used  to  create  CAD  models  of  a  part  based  on 
sensed  data  acquired  using  three-dimensional  posi¬ 
tion  digitization  techniques,  allowing  the  manufac¬ 
ture  of  new  spare  parts  based  on  an  analysis  of  exist¬ 
ing  parts,  without  the  need  for  original  CAD  models 
or  other  documentation  [Traband  etal,  1996].  This 
process  involves  organizing  and  editing  sensed  3-D 
positions  and  then  fitting  surfaces  to  these  points  in 
a  manner  such  that  the  resulting  geometric  models 
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can  be  imported  into  a  CAD  system.  High  preci¬ 
sion  in  the  modeling  process  is  often  required  if  the 
final  re-engineered  parts  are  to  be  usable.  The  sen¬ 
sors  used  to  gather  data  are  subject  to  a  wide  variety 
of  random  and  systematic  errors,  adding  to  the  dif¬ 
ficulty  of  constructing  precise  models. 

Modeling  accuracy  depends  on  effective  use  of 
properties  that  distinguish  the  geometry  of  interest 
from  effects  due  to  sensing  errors.  In  the  case  of 
geometric  modeling  of  man-made  objects,  it  is  of¬ 
ten  critical  to  recognize  that  certain  shapes  are  more 
likely  than  others  and  to  use  this  information  to  re¬ 
duce  the  inherent  degrees  of  freedom  of  the  model¬ 
ing  primitives  used  to  fit  the  data  (Figure  1). 


Figure  1:  Constraints  reduce  degrees-of-freedom 
and  increase  modeling  accuracy. 

Three  important  class  of  constraints  aid  the  re¬ 
verse  engineering  of  mechanical  parts  [Thompson 
and  Henderson,  1996].  Domain-specific  primitives 
reconstruct  part  geometry  using  the  same  model¬ 
ing  primitives  available  to  the  original  part  de¬ 
signer  [Owen  et  al.,  1994,  Thompson  et  al,  1995, 
Thompson  et  al,  1996].  Domain-specific  pragmat¬ 
ics  capture  common  design  practice.  Functional 
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Figure  2:  Steps  in  the  model  construction  process. 


constyaints  capture  aspects  of  the  likely  intent  of  the 
design.  Operationally,  we  exploit  these  constraints 
by  utilizing  three  types  of  information  in  our  model 
creation  process;  geometric  primitives  that  are  ca¬ 
pable  of  describing  object  geometry  with  minimal 
parameters,  local  constraints  which  when  possible 
use  a  restricted  set  of  sub-primitives  to  represent  ge¬ 
ometry,  and  global  constraints  based  on  how  differ¬ 
ent  primitives  interact  in  a  properly  designed  object. 

Figure  2  shows  the  various  steps  involved  in  the  re¬ 
verse  engineering  process  that  we  have  developed. 
A  user  provides  a  rough  description  of  the  manu¬ 
facturing  features  making  up  the  part,  which  is  used 
to  create  an  initial  geometric  model.  This  model  is 
used  to  segment  sensor  data  into  position  points  cor¬ 
responding  to  each  features.  Modeling  primitives 
are  fit  to  the  segmented  data.  The  fitting  and  seg¬ 
menting  process  can  be  iterated  to  refine  the  models. 
Finally  inter-feature  constraints  are  used  to  adjust 
the  individual  feature  descriptions  to  better  approx¬ 
imate  a  globally  optimal  description  of  the  part. 

2  Fitting  Per-Feature  Models. 

To  demonstrate  the  effectiveness  of  feature-based 
reverse  engineering,  we  have  created  a  prototype 
system  called  REFAB  (Reverse  Engineering  - 
FeAture-Based).  In  this  section,  we  outline  the  pro¬ 
cesses  involved  in  fitting  profile  pockets,  the  most 
complex  feature  currently  handled  by  the  system,  to 


sensed  data.  A  profile  pocket  is  a  2  j-D  feature  con¬ 
sisting  of  a  subtractive  volume  defined  by  an  arbi¬ 
trary  closed  planar  contour  which  is  swept  through 
a  specified  distance  along  a  perpendicular  axis. 

2.1  Data  Segmentation 

Most  segmentation  methods  applicable  to  3-D  point 
data  involve  bottom-up  processing,  where  data  is 
grouped  based  on  planar  surfaces  for  polyhedra, 
or  on  orientation  discontinuities  for  curved  sur¬ 
faces  [Besl  and  Jain,  1988,  Suk  and  Bhandarkar, 
1992].  Since  few  machined  parts  are  naturally  rep¬ 
resented  as  polyhedral  models  and  methods  involv¬ 
ing  orientation  discontinuities  are  problematic  with 
noisy  data,  we  utilize  a  more  robust  top-down  seg¬ 
mentation  approach. 

The  segmentation  of  the  3-D  points  corresponding 
to  a  particular  profile  pocket  starts  with  the  user 
sketching  the  profile  contour  and  indicating  points 
on  the  object  at  the  bottom  of  the  pocket  and  on 
the  surface  into  which  the  pocket  is  cut.  Planes 
are  fit  to  the  selected  bottom  and  top  points  us¬ 
ing  the  least  median  squares  (LMedS)  technique 
[Rousseeuw  and  Leroy,  1987].  Next,  planes  are 
fit  to  all  data  points  consistently  close  to  these  ini¬ 
tial  estimates.  The  estimates  allow  the  use  of  an 
efficient  trimmed  distribution  least-squares  method 
without  a  significant  increase  in  the  sensitivity  to 
outliers  [Thompson  et  al,  1995].  The  resulting 
planes  serve  two  purposes:  they  define  the  sweep 
extent  of  the  pocket  and  they  provide  an  estimate  for 
the  axis  along  which  the  profile  contour  is  swept. 

Once  the  orientation  of  the  pocket  is  known,  the  user 
sketch  of  the  pocket  profile  can  be  projected  onto  a 
2-D  plane.  User  specified  points  are  used  to  con¬ 
struct  an  initial  Bezier  curve  approximation  to  the 
profile  side  [Schneider,  1990].  This  in  turn  forms 
the  starting  point  for  the  segmentation  of  3-D  posi¬ 
tion  points  corresponding  to  the  sides  of  the  profile. 
The  full  point  cloud  is  examined  for  points  satisfy¬ 
ing  three  criteria: 

•  Distance  check.  Selected  points  must  be  close 
to  the  profile  curve  along  a  direction  perpen¬ 
dicular  to  the  sweep  axis  of  the  pocket. 

•  Bounding  plane  check.  Selected  points  must 
be  within  the  bovmding  planes  specifying  the 
sweep  extent  of  the  pocket. 

•  Normal  direction  check.  Estimated  surface 
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normals  for  selected  points  must  be  approxi¬ 
mately  perpendicular  to  the  sweep  axis  of  the 
pocket  and  close  to  the  orientation  of  the  near¬ 
est  point  on  the  Bezier  curve. 


Selected  points  likely  to  correspond  to  the  profile 
side  are  used  to  fit  the  profile.  If  necessary,  the  new 
profile  can  be  used  to  produce  a  better  segmentation, 
iterating  the  process. 

2.2  Feature  Parameterization  and  Fitting 

The  previous  steps  result  in  a  geometric  model  spec¬ 
ified  in  terms  of  splines  and  bounding  planes.  This 
is  as  far  as  most  reverse  engineering  systems  pro¬ 
ceed.  Further  simplifications  are  possible,  however, 
by  recognizing  that  it  is  common  design  practice  to 
specify  profiles  in  terms  of  lines  and  arcs  rather  than 
totally  free  form  curves.  We  exploit  this  pragmatic 
by  converting  the  Bezier  curve  representation  of  a 
profile  to  a  line/arc  formalism  whenever  this  ade¬ 
quately  approximates  the  data.  The  result  is  a  model 
that  is  more  likely  to  fit  the  original  design,  though 
the  data  fitting  errors  will  be  larger. 

Our  current  implementation  allows  for  line  seg¬ 
ments  interspersed  with  one,  two,  or  three  consecu¬ 
tive  constant  radius  arcs.  The  entire  profile  is  pre¬ 
sumed  to  be  G1  (tangent)  continuous.  Triple  arcs 
are  limited  to  a  special  reinforcing  feature  (boss) 
which  is  symmetric.  Because  we  have  a  good  esti¬ 
mate  of  the  sweep  direction  of  the  feature,  all  anal¬ 
ysis  can  be  done  in  2— D  by  projecting  segmented 
points  corresponding  to  the  side  of  the  pocket  into  a 
plane  perpendicular  to  the  sweep  direction. 

Figure  3  outlines  the  process.  First,  the  data  points 
are  ordered  by  projecting  them  onto  the  spline  ap¬ 
proximation  of  the  profile.  Then,  standard  polyline 
approximation  algorithms  can  be  used  to  represent 


1.  Construct  polyline  approximation  to  prcH 
file. 

2.  Refit  long  line  segments  to  data 

3.  Identify  comers 

4.  Identify  single,  double,  and  triple  arcs 

5.  Fit  remaining  data  with  splines. 


Figure  3:  Contour  Fitting 


the  data  as  a  sequence  of  line  segments  [Jain  et  al, 
1995].  Long  segments  in  the  polyline  are  presumed 
to  correspond  to  line  segments  in  the  original  de¬ 
sign.  To  improve  the  accuracy  with  which  these 
line  segments  are  described,  they  are  re-fit  using  a 
trimmed  distribution  least-squares  method  similar 
to  that  used  for  plane  fitting. 

Non-contact  position  digitizers  smooth  data  near 
sharp  comers.  As  a  result,  the  existence  of  comers 
must  be  inferred  when  the  polyline  has  two  nearly 
adjacent  long  line  segments  with  distinctly  different 
orientations.  In  such  cases,  the  line  segments  are 
extended  to  form  a  tme  comer.  The  resulting  poly¬ 
line  consists  of  long  segments  corresponding  to  tme 
straight  lines  in  the  original  profile  and  sequences  of 
shorter  line  segments  corresponding  to  curves. 

An  attempt  is  made  to  fit  each  of  these  curve  sec¬ 
tions  with  sub-features  corresponding  to  a  single, 
double,  or  triple  arc.  The  dimensionality  of  the 
curve  fitting  can  be  substantially  reduced  by  taking 
advantage  of  tangent  continuity.  A  generic  single 
arc  has  five  degrees  of  freedom:  the  center  point, 
two  angles,  and  a  radius.  Tangent  continuity  re¬ 
duces  this  to  one,  a  radius.  A  double  arc  can  be 
modeled  using  only  four  parameters.  The  symmet¬ 
ric  triple  arc  has  only  three  degrees  of  freedom.  If 
the  data  cannot  be  adequately  approximated  with 
any  of  these  sub-features,  it  is  assumed  to  be  a  free 
form  curve  and  is  approximated  with  a  spline. 

2.3  Example:  Fitting  a  Triple  Arc 
Sub-Feature 

In  the  case  of  the  symmetric  triple  arc,  we  use  a 
parameterization  specifying  the  center  of  the  first 
arc  and  the  offset  of  the  center  of  the  middle  arc 
from  a  line  connecting  the  centers  of  the  two  outside 
arcs  (Figure  4.a).  Together  with  the  tangent  con¬ 
tinuity  constraints  deriving  from  the  adjacent  line 
segments,  this  fully  specifies  the  configuration.  Fit¬ 
ting  arc  sub-features  requires  non-linear  optimiza¬ 
tion.  Both  performance  and  the  quality  of  the  fi¬ 
nal  result  depend  on  the  availability  of  a  good  ini¬ 
tial  guess.  For  triple  arcs,  the  line  segments  which 
form  inflections  in  the  polyline  are  found  first  (Fig¬ 
ure  4.C.)  L2  is  produced  by  building  a  line  segment 
between  pi  (the  end  of  tangent  line  Tl)  and  p2  (the 
mid  point  of  the  first  inflection  segment.)  The  center 
of  arc  one  is  hypothesized  by  intersecting  the  per¬ 
pendicular  bisector  of  L2  and  the  line  perpendicular 
to  Tl  at  /?!.  The  center  of  arc  two  is  generated  by 


929 


Figure  4:  Triple  arc  sub-feature  parameterization, 
initialization,  and  fitting. 


intersecting  the  line  L3  with  the  line  of  symmetry. 
The  final  arc  is  symmetric  to  the  first. 

These  guesses  are  used  to  instantiate  the  initial  pa¬ 
rameterization  for  the  triple  arc  fitting  process.  This 
fitting  process  is  a  non-linear  optimizer  which  per¬ 
turbs  the  parameters  until  no  further  perturbations 
reduce  the  total  error  from  the  data  to  the  feature. 


Figure  4.d  shows  the  result  of  the  final  fit.  Similar 
techniques  are  used  for  the  other  arc  features. 

3  Optimizing  Constrained  Feature-Based 
Models 

Reconstructing  models  of  free  form  geometrical 
surfaces  is  a  difficult  process,  requiring  the  instan¬ 
tiation  of  large  numbers  of  parameters.  Fitting  real 
world  data  complicates  this  process  by  introducing 
problems  of  occlusion  and  noise.  By  asserting  func¬ 
tional  constraints  over  the  geometry,  the  number  of 
parameters  can  be  greatly  reduced  and  the  suscep¬ 
tibility  to  noise  lowered.  In  addition,  the  result¬ 
ing  model  can  be  made  to  conform  to  preconceived 
ideas  about  standard  2  ^-D  geometry  used  in  man¬ 
ufacturing. 

In  much  the  same  way  as  an  understanding  of  the 
manufacturing  process  suggests  a  feature-based  ap¬ 
proach  to  reverse  engineering,  additional  under¬ 
standing  of  the  manufacturing  and  design  processes 
suggests  that  features  are  not  encountered  in  isola¬ 
tion,  and  thus  certain  geometric  properties  should 
hold  over  features.  We  believe  that  an  understand¬ 
ing  of  the  design  process  can  yield  large  benefits 
in  the  ability  to  create  accurate  models.  Feature- 
based  models  provide  modeling  primitives  and  can 
enforce  low-level  constraints  over  these  primitives, 
reducing  the  number  of  variables  necessary  to  rep¬ 
resent  an  object.  Constraint  based  techniques  apply 
high-level  constraints  over  these  features,  further  re¬ 
ducing  the  DOFs  as  well  as  enforcing  hypothesized 
design  intent. 

The  geometry  found  in  most  manufactured  parts  is 
based  on  simple  and  consistent  rules  [Hsu,  1996]. 
Such  parts  are  usually  designed  in  the  most  straight¬ 
forward  way,  using  common  design  practices.  This 
motivates  the  idea  that  apparent  relations  between 
features  are  likely  to  be  by  design  and  not  by  coin¬ 
cidence.  Some  modem  design  systems  are  starting 
to  enforce  this  idea  by  restricting  the  designer  to  a 
set  of  design  features  (hierarchical  stmctures  related 
to  design  intent.)  We  utilize  this  information  by  en¬ 
forcing  constraints  during  the  reverse  engineering 
process.  For  example,  if  two  planar  surfaces  on  a 
part  appear  parallel,  then  logic  dictates  that  the  de¬ 
signer  intended  them  to  be  parallel  and  they  are  fit 
as  such. 

Consider  the  feature-based  model  shown  in  Fig¬ 
ure  5.  An  engineer,  given  a  part  milled  from  this 
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Figure  5:  Constraints  on  Lower  Link  Model 


model,  would  no  doubt  recognize  a  simple  2— D 
profile,  two  weight  reduction  pockets,  and  several 
through  holes.  Further  inspection  might  lead  the  en¬ 
gineer  to  hypothesize  certain  geometric  constraints 
that  are  likely  to  hold  for  this  type  of  part:  the 
part  seems  symmetric,  the  larger  holes  seem  to  be 
the  same  size  (two  seem  to  be  concentric),  the  in¬ 
ner  pocket  seems  to  be  aligned  with  the  outer  pro¬ 
file,  etc.  Figure  5  displays  these  and  other  con¬ 
straints.  By  enforcing  these  properties  during  model 
creation,  the  final  result  is  more  likely  to  accurately 
represent  the  intended  geometry  of  the  part. 

The  advantages  of  constraint  based  model  fitting  are 
threefold.  First,  the  resultant  model  is  more  likely 
to  capture  the  design  intent  of  the  part.  Secondly, 
the  number  of  degrees  of  freedom  necessary  to  fit 
the  part  can  be  significantly  reduced,  increasing  the 
speed  and  accuracy  of  the  fitting  process  and  reduc¬ 
ing  the  sensitivity  to  noise.  Finally,  features  on  the 
part  which  are  hardest  to  sense  or  perhaps  contain 
the  greatest  degradation  can  often  be  constrained  to 
other  features  which  correspond  to  more  valid  data. 
We  call  this  process  accuracy  transference. 

3.1  The  Constrained  Optimization  Process 

Modeling  accuracy  can  be  increased  by  recogniz¬ 
ing  constraints  between  feature  primitives.  The  con- 
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Figure  7:  Supported  Constraints 


strained  optimization  process  is  initialized  with  a 
feature-based  model  (Figure  6).  The  following  three 
steps  are  then  applied.  First,  a  hypothesis  step  at¬ 
tempts  to  semi-automatically  identify  possible  con¬ 
straints  on  the  model.  Second,  a  new  model  is 
formed  which  reduces  the  number  of  parameters 
necessary  to  represent  the  object.  During  this  stage, 
all  constraints  which  are  not  geometrically  repre¬ 
sented  are  expressed  by  penalty  functions.  Finally^ 
an  optimization  process,  utilizing  a  weak  search  al¬ 
gorithm,  is  used  to  adjust  the  parameters  in  order 
to  produce  the  best  model  which  fits  the  data  while 
obeying  the  constraints. 

3.1.1  Constraint  Hypothesis 

An  understanding  of  the  design  process  suggests 
certain  practices  which  can  be  directly  described 
by  geometric  constraints.  We  list  the  2  ~D  con¬ 
straints  we  have  identified  and  implemented  in  Fig¬ 
ure  7.  Radius  and  width  equivalence  apply  to  dis¬ 
tance  measures.  Tangent  equivalence  dictates  that 
intersecting  sub-features,  such  as  lines  and  arcs,  are 
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G1  continuous.  Identical  but  offset  features  rep¬ 
resent  geometrically  identical  features  which  are 
transformed  into  separate  frames.  Parallelism  and 
perpendicularity  are  applied  both  to  2— D  line  seg¬ 
ments  and  to  3-D  planar  surfaces.  The  other  con¬ 
straints  have  standard  geometric  interpretations. 

Currently  a  semi-automatic  constraint  finding  algo¬ 
rithm  is  used.  The  algorithm  hypothesizes  possible 
constraints  based  on  parametric  equivalence.  For 
instance,  consider  the  2— D  projection  of  two  holes 
onto  the  same  anchor  plane.  If  the  centers  of  these 
two  holes  are  located  within  some  small  distance, 
ei ,  from  each  other,  then  there  is  high  likelihood  that 
the  two  holes  are  concentric  and  the  fit  should  con¬ 
strain  them  as  such.  Likewise,  if  two  profile  lines 
intersect  with  an  angle  of  90  degrees  +/-  S2,  then 
a  perpendicular  constraint  is  likely,  and  should  be 
asserted.  The  algorithm  cross  checks  all  features 
and  sub-features  versus  all  other  features  and  sub¬ 
features  looking  at  all  appropriate  constraints.  Tan- 
gency  constraints  are  initially  asserted  for  all  profile 
contours. 

Once  the  system  has  asserted  all  likely  constraints, 
the  user  is  requested  to  verify  them  and  is  further 
allowed  to  add  new  constraints.  This  is  necessary 
because  the  hypothesis  process  can  fail  in  two  in¬ 
stances.  First,  if  the  original  fit  is  off,  then  para¬ 
metric  equivalence  is  unlikely  to  be  satisfied.  This 
can  be  the  case  when  the  part  was  poorly  manufac¬ 
tured  or  fixtured,  or  when  the  sensing  process  can¬ 
not  acquire  valid  data.  Second,  complex  constraints 
such  as  symmetry  are  not  directly  inferred  from  the 
model  and  require  manual  intervention.  In  our  test 
cases,  the  hypothesis  process  has  been  able  to  iden¬ 
tify  most  valid  constraints,  with  less  than  a  5%  false 
positive  ratio. 

3.1.2  DOF  Reduction  and  Penalty 
Functions 

Once  the  constraints  on  the  part  have  been  speci¬ 
fied,  the  process  generates  a  new  model  for  the  part 
based  on  these  constraints  and  the  initial  feature- 
based  model.  First,  the  parameter  set  is  reduced 
using  degree  of  freedom  reduction.  Second,  any 
constraints  which  cannot  be  represented  by  a  sim¬ 
ple  parameter  reduction  are  implemented  by  build¬ 
ing  penalty  functions  to  represent  them.  The  result¬ 
ing  model  includes  a  reduced  parameterization  and 
a  set  of  penalty  functions  associated  with  the  con¬ 
straints.  This  new  model  is  used  to  drive  the  opti¬ 


mization  process. 

DOF  reduction  is  the  symbolic  substitution  of  para¬ 
metric  variables  representing  the  same  values.  For 
example,  a  hole  is  represented  by  its  eenter  and  ra¬ 
dius.  If  the  hole  is  concentric  to  another  hole,  then  a 
symbolic  substitution  is  made  so  that  any  reference 
to  the  second  hole’s  center  is  replaced  by  the  first 
hole’s  center.  Likewise,  two  parallel  lines,  using  po¬ 
lar  coordinates,  can  use  the  same  symbolic  variable 
for  their  theta  parameter.  Unfortimately,  symbolic 
substitution  does  not  work  when  the  parameteriza- 
tions  between  two  features  are  not  compatible  or 
when  this  would  over  constrain  the  model.  For  ex¬ 
ample,  consider  a  single  arc  concentric  with  a  triple 
arc.  The  center  of  a  single  arc  sub-feature  is  com¬ 
puted  using  its  tangent  lines  and  the  radius  of  the 
arc.  The  center  of  a  triple  arc  feature  is  computed 
based  on  its  tangent  lines  and  three  non-intuitive 
parameters.  Both  parameterizations  imply  a  center 
but  do  not  explicitly  contain  that  information.  Thus 
to  symbolically  try  to  equate  the  two  features  cen¬ 
ters  would  be  meaningless.  Therefore,  another  way 
must  be  used  to  represent  these  constraints. 

Such  constraints  can  be  expressed  using  penalty 
functions.  Penalty  functions  are  a  class  of  functions 
which  equate  to  zero  when  a  condition  is  met,  but 
rapidly  increase  in  value  as  the  condition  decreases 
in  validity.  Thus,  the  penalty  function,  (j),  associated 
with  two  concentric  holes  might  be: 

where  p\  and  p2  are  the  centers  of  hole  one  and  hole 
two,  respectively,  8  computes  the  Euclidean  dis¬ 
tance  between  two  points,  and  5^  is  a  non-negative 
constant.  By  setting  X  to  zero,  the  constraint  is  ig¬ 
nored;  by  setting  X  to  an  arbitrarily  large  value,  the 
optimization  process  strictly  enforces  the  constraint. 
Optimization  theory  details  the  use  of  penalty  func¬ 
tions  and  the  use  of  multiple  X  values  [Gottfried 
and  Wiesman,  1973].  By  using  an  exterior-point  al¬ 
gorithm  we  are  able  to  start  our  search  outside  the 
feasible  area  and  then  move  into  it,  guided  by  the 
data  fit  and  constraint  penalties. 

3.1.3  Optimized  Fitting 

A  generic  optimization  problem  involves  follow¬ 
ing  a  multi-dimensional  function  down  hill  until  a 
global  (or  local)  minimum  is  found.  This  is  accom¬ 
plished  by  iterating  over  a  list  of  parametric  val- 
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ues,  perturbing  them  until  no  new  change  leads  to 
a  lower  value.  Once  a  minimum  is  found,  the  search 
routine  is  generally  reinitialized  with  the  computed 
values  and  restarted  in  an  attempt  to  escape  local 
minima.  Refer  to  [Press  et  al,  1991]  for  a  discus¬ 
sion  on  the  Amoeba  and  Powell  methods,  which  we 
currently  use. 

The  error  function  to  be  optimized  is  based  on 
the  distance  from  the  sensed  data  to  the  generated 
model  plus  the  penalties  associated  with  each  con¬ 
straint  penalty  function. 


/  p 

model  error  =  ^5;  *  ^  5{ptj,Fi) 
i=Q  7=0 

P 

penalty  error  =  ^  *  (P^ 

1=0 

Where  /  is  the  number  of  features,  S  is  the  prece¬ 
dence  associated  with  the  current  feature,  j  is  the 
number  of  points  for  feature  i,  5  is  the  distance  from 
the  data  point  to  the  feature,  and  F  is  the  set  of  fea¬ 
tures.  %  is  the  penalty  function  multiplier,  P  is  the 
number  of  penalty  functions,  k  is  a  scaling  constant, 
and  !P  is  the  set  of  penalty  functions.  The  sum  of  the 
model  error  plus  the  penalty  error  is  used  to  drive 
the  optimization  process. 

During  each  iteration  of  the  search,  the  value  % 
is  gradually  increased  in  order  to  enforce  the  con¬ 
straints  without  adversely  restricting  the  search. 
The  scaling  factors  ki  transform  the  various  penalty 
measures  into  one  meaningful  level  of  magnitude. 
The  precedence  value  S  gives  the  engineer  a  way  to 
weight  the  data  pertaining  to  each  feature,  and  thus 
transfer  accuracy  from  features  fit  with  presumed 
accurate  data  to  features  where  data  is  unreliable. 

The  penalty  function  multipliers  must  be  chosen  so 
as  to  enforce  the  constraints  without  removing  the 
impact  of  the  data  on  the  fit.  Thus,  as  the  constraints 
are  more  rigidly  enforced,  the  associated  error  to 
the  data  will  increase  while  the  error  to  the  intended 
model  will  decrease. 

4  Results 

We  have  applied  our  reverse  engineering  techniques 
to  several  2  D  parts  from  the  Utah  benchmark 
suite  [Thompson  and  Owen,  1994].  The  benchmark 
suite  provides  high  quality  milled  parts  and  their 
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12 
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6 

6 
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48 
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21 

9 

36 

9 

40 
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3 

3 
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1 

63 
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3 
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1 

71 
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12 

12 

39 

6 

21 
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60 

86 

30 
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25 

13 

51 

11 

38 
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3 

3 

55 

1 

44 

planar  surfaces 

6 

6 

32 

4 

32 
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34 

22 

47 

16 

36 

*  Identical  but  offset  to  the  top  pocket  feature. 


Table  1:  Difference  between  recovered  model  and 
original  model  (in  microns). 


CAD  models,  allowing  us  to  produce  quantitative 
analysis  of  the  reverse  engineering  process.  Each 
part  was  painted  in  order  to  avoid  specularity  prob¬ 
lems,  and  then  scanned  from  multiple  views  using 
a  non-contact  laser  digitizer.  Refer  to  [Thompson 
et  al,  1995]  for  more  detail  on  the  sensing  process. 
The  resultant  unordered  3-D  point  clouds  were  reg¬ 
istered  into  a  single  coordinate  system  based  on 
common  planar  surfaces  [Shum  et  al,  1994]. 

The  combined  data  cloud  for  each  part,  was  reverse 
engineered  using  the  REFAB  system.  This  output 
was  utilized  to  instantiate  the  constraint  process. 
Both  methods  produce  CAD  models  compatible 
with  the  University  of  Utah’s  Alpha_1  CAD/CAM 
system  [Drake  and  Sela,  1989].  The  results  shown 
in  Table  1  represent  the  error  associated  with  each 
part  and  the  degree-of-freedom  reduction  accom¬ 
plished.  Precision  was  seen  to  improve  by  a  factor 
of  1.2  to  2.0  based  on  the  number  of  constraints  as¬ 
serted  on  each  part.  The  error  measure  represents 
the  RMS  value  of  the  distance  from  a  uniformly 
sampled  distribution  of  points  on  the  reverse  engi¬ 
neered  model  to  the  original  Alpha_1  model.  This 
value  was  calculated  using  standard  CAGD  meth¬ 
ods. 
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5  Conclusion 

Most  model  reconstruction  techniques  apply 
generic  primitives  to  a  wide  range  of  problems.  We 
use  domain-specific  primitives  and  information  on 
a  narrow  set  of  objects,  in  order  to  reconstruct  high 
precision  models.  Three  levels  of  abstraction  are 
used.  First,  we  use  features  to  describe  the  geome¬ 
try  of  the  object  with  minimal  degrees-of-ffeedom. 
Second,  we  apply  local  constraints  to  these  features, 
further  reducing  the  DOFs.  Finally,  we  apply  global 
constraints  to  features  and  sub-features,  once  again 
reducing  the  DOFs  associated  with  the  object  while 
also  attempting  to  capture  the  original  design  intent. 

Domain-specific  reconstruction  techniques  require 
detailed  information  of  the  area  of  interest,  but  pro¬ 
vide  substantial  benefits  in  the  accuracy  and  useful¬ 
ness  of  the  recovered  model.  We  have  quantitatively 
shown  their  utility  in  the  realm  of  2  j-D  milled 
parts  and  have  described  in  detail,  the  underlying 
techniques  used  by  our  system  to  segment  data,  fit 
feature-based  models,  and  to  apply  constraints  to 
these  models. 
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Abstract 

A  novel  scene  reconstruction  technique  is  pre¬ 
sented,  different  from  previous  approaches  in 
its  ability  to  cope  with  large  changes  in  visi¬ 
bility  and  its  modeling  of  intrinsic  scene  color 
and  texture  information.  The  method  avoids 
image  correspondence  problems  by  working  in 
a  discretized  scene  space  whose  voxels  are  tra¬ 
versed  in  a  fixed  visibility  ordering.  This  strat¬ 
egy  takes  full  account  of  occlusions  and  allows 
the  input  cameras  to  be  far  apart  and  widely 
distributed  about  the  environment.  The  algo¬ 
rithm  identifies  a  special  set  of  invariant  voxels 
which  together  form  a  spatial  and  photometric 
reconstruction  of  the  scene,  fully  consistent  with 
the  input  images. 

1  Introduction 

We  consider  the  problem  of  acquiring  photo¬ 
realistic  3D  models  of  real  environments  from 
widely  distributed  viewpoints.  This  problem 
has  sparked  recent  interest  in  the  computer  vi¬ 
sion  community  [Kanade  et  al,  1995,  Moezzi 
et  ai,  1996,  Beardsley  et  al,  1996,  Leymarie 
et  al,  1996]  as  a  result  of  new  applications  in 
telepresence,  virtual  walkthroughs,  automatic 
3D  model  construction,  and  other  problems  that 
require  realistic  textured  object  models. 

We  use  the  term  photorealism  to  refer  to  3D 
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reconstructions  of  real  scenes  whose  reprojec¬ 
tions  contain  sufficient  color  and  texture  infor¬ 
mation  to  accurately  reproduce  images  of  the 
scene  from  a  wide  range  of  target  viewpoints. 
To  ensure  accurate  reprojections,  the  input  im¬ 
ages  should  be  representative,  i.e.,  distributed 
throughout  the  target  range  of  viewpoints.  Ac¬ 
cordingly,  we  propose  two  criteria  that  a  photo¬ 
realistic  reconstruction  technique  should  satisfy: 

•  Photo  Integrity:  The  reprojected  model 
should  accurately  reproduce  the  color  and 
texture  of  the  input  images 

•  Broad  Coverage:  The  input  images  should 
be  widely  distributed  throughout  the  envi¬ 
ronment,  enabling  a  wide  coverage  of  scene 
surfaces 

Instead  of  using  existing  stereo  and  structure- 
from-motion  methods  to  solve  this  problem, 
we  choose  to  approach  it  from  first  principles. 
We  are  motivated  by  the  fact  that  current  re¬ 
construction  techniques  were  not  designed  with 
these  objectives  in  mind  and,  as  we  will  argue, 
do  not  fully  meet  these  requirements.  Driven 
by  the  belief  that  photo  integrity  has  more  to 
do  with  color  than  shape,  we  formulate  a  color 
reconstruction  problem,  in  which  the  goal  is  an 
assignment  of  colors  (radiances)  to  points  in  an 
(unknown)  approximately  Lambertian  scene.  It 
is  shown  that  certain  points  have  an  invariant 
coloring,  constant  across  all  possible  interpre¬ 
tations  of  the  scene,  consistent  with  the  input 
images.  This  leads  to  a  volumetric  voxel  col¬ 
oring  method  that  labels  the  invariant  scene 
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voxels  based  on  their  projected  correlation  with 
the  input  images.  By  traversing  the  voxels  in 
a  special  order  it  is  possible  to  fully  account 
for  occlusions — a  major  advantage  of  this  scene- 
based  approach.  The  result  is  a  complete  3D 
scene  reconstruction,  built  to  maximize  photo 
integrity. 

The  photorealistic  scene  reconstruction  prob¬ 
lem,  as  presently  formulated,  raises  a  number 
of  unique  challenges  that  push  the  limits  of 
existing  techniques.  First,  the  reconstruction 
must  be  dense  and  sufficiently  accurate  to  re¬ 
produce  the  original  images.  This  requirement 
poses  a  problem  for  feature-based  reconstruc¬ 
tion  methods,  which  produce  relatively  sparse 
reconstructions.  Although  sparse  reconstruc¬ 
tions  can  be  augmented  by  fitting  surfaces  (e.g., 
[Beardsley  et  al,  1996]),  the  triangulation  tech¬ 
niques  currently  used  cannot  easily  cope  with 
discontinuities  and,  more  importantly,  are  not 
image  driven.  Consequently,  surfaces  derived 
from  sparse  reconstructions  may  only  agree  with 
the  input  images  at  points  where  image  features 
were  detected. 

Contour-based  methods  (e.g.,  [Cipolla  and 
Blake,  1992,  Szeliski,  1993,  Seales  and  Faugeras, 
1995])  are  attractive  in  their  ability  to  cope 
with  changes  in  visibility,  but  do  not  produce 
sufficiently  accurate  depth-maps  due  to  prob¬ 
lems  with  concavities  and  lack  of  parallax  in¬ 
formation.  A  purely  contour-based  reconstruc¬ 
tion  can  be  texture-mapped,  as  in  [Moezzi  et 
al,  1996],  but  not  in  a  way  that  ensures  pro¬ 
jected  consistency  with  all  of  the  input  images, 
due  to  the  aforementioned  problems.  In  addi¬ 
tion,  contour-based  methods  require  occluding 
contours  to  be  isolated;  a  difficult  segmentation 
problem  avoided  by  voxel  coloring. 

The  second  objective  requires  that  the  input 
views  be  scattered  over  a  wide  area  and  there¬ 
fore  exhibit  large  scale  changes  in  visibility  (i.e., 
occlusions,  changing  field  of  view).  While  some 
stereo  methods  can  cope  with  limited  occlu¬ 
sions,  visibility  changes  of  much  greater  mag¬ 
nitude  appear  to  be  beyond  the  state  of  the  art. 
In  addition,  the  views  may  be  far  apart,  making 
the  correspondence  problem  extremely  difficult. 
Existing  stereo-based  approaches  to  this  prob¬ 


lem  [Kanade  et  al,  1995]  match  nearby  images 
two  or  three  at  a  time  to  ameliorate  visibility 
problems.  This  approach,  however,  does  not 
fully  integrate  the  image  information  and  intro¬ 
duces  new  complications,  such  as  how  to  merge 
the  partial  reconstructions. 

The  voxel  coloring  algorithm  presented  here 
works  by  discretizing  scene  space  into  a  set  of 
voxels  that  are  traversed  and  colored  in  a  special 
order.  In  this  respect,  the  method  is  similar  to 
Collins’  Space-Sweep  approach  [Collins,  1996], 
which  performs  an  analogous  scene  traversal. 
However,  the  Space-Sweep  algorithm  does  not 
provide  a  solution  to  the  occlusion  problem,  a 
primary  contribution  of  this  paper.  Katayama 
et  al.  [Katayama  et  al,  1995]  described  a  re¬ 
lated  method  in  which  images  are  matched  by 
detecting  lines  through  slices  of  an  epipolar  vol¬ 
ume,  noting  that  occlusions  could  be  explained 
by  labeling  lines  in  order  of  increasing  slope. 
This  ordering  is  consistent  with  our  results,  fol¬ 
lowing  from  the  derivations  in  Section  2.  How¬ 
ever,  their  algorithm  used  a  reference  image, 
thereby  ignoring  points  that  are  occluded  in 
the  reference  image  but  visible  elsewhere.  Also, 
their  line  detection  strategy  requires  that  the 
views  all  lie  on  a  straight  line,  a  significant 
limitation.  An  image-space  visibility  ordering 
was  described  by  McMillan  and  Bishop  [McMil¬ 
lan  and  Bishop,  1995],  but  was  used  only  for 
image  rendering,  not  correspondence  computa¬ 
tion  or  scene  reconstruction.  Also  related  are 
recently  developed  panoramic  stereo  [McMillan 
and  Bishop,  1995,  Kang  and  Szeliski,  1996]  al¬ 
gorithms  that  avoid  field  of  view  problems  by 
matching  360  degree  panoramic  views  directly. 
Panoramic  reconstructions  can  also  be  achieved 
using  our  approach,  but  without  the  need  to 
first  build  panoramic  images  (see  Figures  1(b) 
and  4). 

The  remainder  of  the  paper  is  organized  as  fol¬ 
lows.  Section  2  formulates  and  solves  the  voxel 
coloring  problem,  and  describes  its  relationship 
to  shape  reconstruction.  Section  3  presents  an 
efficient  algorithm  for  computing  the  voxel  col¬ 
oring  from  a  set  of  images.  Section  4  describes 
some  experiments  on  real  and  synthetic  image 
sequences  that  demonstrate  how  the  method 
performs. 
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(a)  (b) 

Figure  1:  Two  camera  geometries  that  satisfy 
the  ordinal  visibility  constraint. 

2  Voxel  Coloring 

This  section  describes  the  voxel  coloring  prob¬ 
lem  in  detail.  The  main  results  require  a  visibil¬ 
ity  property  that  constrains  the  camera  place¬ 
ment  relative  to  the  scene,  but  still  permits  the 
input  cameras  to  be  spread  widely  throughout 
the  scene.  The  visibility  property  defines  a  fixed 
occlusion  ordering,  enabling  scene  reconstruc¬ 
tion  with  a  single  pass  through  the  voxels  in 
the  scene. 

We  assume  that  the  scene  is  entirely  composed 
of  rigid  Lambertian  surfaces  under  fixed  illumi¬ 
nation.  Under  these  conditions,  the  radiance  at 
each  point  is  isotropic  and  can  therefore  be  de¬ 
scribed  by  a  scalar  value  which  we  call  color. 
We  also  use  the  term  color  to  refer  to  the  irra- 
diance  of  an  image  pixel.  The  term’s  meaning 
should  be  clear  by  context. 

2.1  Notation 

A  3D  scene  S  is  represented  as  a  finite^  set  of 
opaque  voxels  (volume  elements),  each  of  which 
occupies  a  finite  and  homogeneous  scene  volume 
and  has  a  fixed  color.  We  denote  the  set  of  all 
voxels  with  the  symbol  V.  An  image  is  specified 
by  the  set  I  of  all  its  pixels.  For  now,  assume 
that  pixels  are  infinitesimally  small. 

Given  an  image  pixel  p  and  scene  S,  we  refer 
to  the  voxel  V  E  S  that  is  visible  and  projects 
to  p  by  V  =  S{p).  The  color  of  an  image  pixel 
p  E  I  is  given  by  color{p,I)  and  of  a  voxel  V 
by  color {V,S).  A  scene  S  is  said  to  be  complete 
with  respect  to  a  set  of  images  if,  for  every  image 

^It  is  assumed  that  the  visible  scene  is  spatially 
bounded. 


I  and  every  pixel  p  E  I,  there  exists  a  voxel 
V  E  S  such  that  V  =  S(p).  A  complete  scene 
is  said  to  be  consistent  with  a  set  of  images  if, 
for  every  image  X  and  every  pixel  p  EX, 

color{p,X)  =  color{S{p),S)  (1) 

2.2  Camera  Geometry 

A  pinhole  perspective  projection  model  is  as¬ 
sumed,  although  the  main  results  use  a  visibility 
assumption  that  applies  equally  to  other  cam¬ 
era  models  such  as  orthographic  and  aperture- 
based  models.  We  require  that  the  viewpoints 
(camera  positions)  are  distributed  so  that  ordi¬ 
nal  visibility  relations  between  scene  points  are 
preserved.  That  is,  if  scene  point  P  occludes  Q 
in  one  image,  Q  cannot  occlude  P  in  any  other 
image.  This  is  accomplished  by  ensuring  that 
all  viewpoints  are  “on  the  same  side”  of  the  ob¬ 
ject.  For  instance,  suppose  the  viewpoints  are 
distributed  on  a  single  plane,  as  shown  in  Fig¬ 
ure  1(a).  For  every  such  viewpoint,  the  relative 
visibility  of  any  two  points  depends  entirely  on 
which  point  is  closer  to  the  plane.  Because  the 
visibility  order  is  fixed  for  every  viewpoint,  we 
say  that  this  range  of  viewpoints  preserves  or¬ 
dinal  visibility. 

Planarity,  however,  is  not  required;  the  ordinal 
visibility  constraint  is  satisfied  for  a  relatively 
wide  range  of  viewpoints,  allowing  significant 
flexibility  in  the  image  acquisition  process.  Ob¬ 
serve  that  the  constraint  is  violated  only  when 
there  exist  two  scene  points  P  and  Q  such  that 
P  occludes  Q  in  one  view  while  Q  occludes  P 
in  another.  This  condition  implies  that  P  and 
Q  lie  on  the  line  segment  between  the  two  cam¬ 
era  centers.  Therefore,  a  sufficient  condition  for 
the  ordinal  visibility  constraint  to  be  satisfied 
is  that  no  scene  point  be  contained  within 
the  convex  hull  C  of  the  camera  centers. 
For  convenience,  C  will  be  referred  to  as  the 
camera  volume.  We  use  the  notation  dist{V,  C) 
to  denote  the  distance  of  a  voxel  V  to  the  cam¬ 
era  volume.  Figure  1  shows  two  useful  cam¬ 
era  geometries  that  satisfy  this  constraint,  one 
a  downward  facing  camera  moved  360  degrees 
around  an  object,  and  the  other  outward  facing 
cameras  on  a  sphere. 
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Figure  2-  fa-d)  Four  scenes  that  are  indistinguishable  from  these  two  viewpoints.  Shape  am- 
"  MguW  scenes  (a)  and  (b)  have  no  points  in  common-no  hard  pomts  e-t  Cobr 

ar^biguity:  (c)  and  (d)  share  a  point  that  has  a  different  color  assignment  m  the  tw 
scenes,  (e)  The  voxel  coloring  produced  from  the  two  images  in  (a-d).  These  six  points 
have  the  same  color  in  every  consistent  scene  that  contains  them. 


2.3  Color  Invariance 

It  is  well  known  that  a  set  of  images  can  be  con¬ 
sistent  with  more  than  one  rigid  scene.  Deter¬ 
mining  a  scene’s  spatial  occupancy  is  therefore 
an  ill-posed  task  because  a  voxel  contained  in 
one  consistent  scene  may  not  be  contained  in 
another  (Figure  2(a,b)).  Alternatively,  a  voxel 
may  be  part  of  two  consistent  scenes,  but  have 
different  colors  in  each  (Figure  2(c,d)). 

Given  a  multiplicity  of  solutions  to  the  re¬ 
construction  problem,  the  only  way  to  re¬ 
cover  intrinsic  scene  information  is  through 
invariants —  properties  that  are  satisfied  by  ev¬ 
ery  consistent  scene.  For  instance,  consider  the 
set  of  voxels  that  are  present  in  every  consistent 
scene.  Laurentini  [Laurentini,  1995]  described 
how  these  invariants,  called  hard  points,  could 
be  recovered  by  volume  intersection  from  binary 
images.  Hard  points  are  useful  in  that  they  pro¬ 
vide  absolute  information  about  the  true  scene. 
However,  such  points  can  be  difficult  to  come 
by;  some  images  may  yield  none  (e.g..  Figure  2). 
In  this  section  we  describe  a  more  frequently  oc¬ 
curring  type  of  invariant  relating  to  color  rather 
than  shape. 

A  voxel  H  is  a  color  invariant  with 
respect  to  a  set  of  images  if,  for  ev¬ 
ery  pair  of  scenes  S  and  S'  consistent 
with  the  images,  V  G  S,  S'  implies 
color{V,S)  =  color(y,S') 

Unlike  shape  invariance,  color  invariance  does 
not  require  that  a  point  be  present  in  every  con¬ 


sistent  scene.  As  a  result,  color  invariants  tend 
to  be  more  common  than  hard  points.  In  par¬ 
ticular,  any  set  of  images  satisfying  the  ordinal 
visibility  constraint  yields  enough  color  invari¬ 
ants  to  form  a  complete  scene  reconstruction,  as 
will  be  shown. 

Let  Ii, . . .  ,Tm  be  a  set  of  images.  For  a  given 
image  point  p  €  Xj  define  Vp  to  be  the  voxel 
in  {5(p)  1  S  consistent}  that  is  closest  to  the 
camera  volume.  We  claim  that  Vp,  is  a  color 
invariant.  To  establish  this,  observe  that  Vp  G 
S  implies  Vp  =  S{p),  for  if  ^  S{p),  S{p) 
must  be  closer  to  the  camera  volume,  which  is 
impossible  by  the  construction  of  Vp.  It  then 
follows  from  Eq.  (1)  that  Vp  has  the  same  color 
in  every  consistent  scene;  Vp  is  a.  color  invariant. 

The  voxel  coloring  of  an  image  set 
is  defined  to  be: 

S  =  {VplpeXi,  l<i<m} 

Figure  2(e)  shows  the  voxel  coloring  resulting 
from  a  pair  of  views.  These  six  points  have  a 
unique  color  interpretation,  constant  in  every 
consistent  scene.  They  also  comprise  the  closest 
consistent  scene  to  the  cameras  in  the  following 
sense — every  point  in  each  consistent  scene  is 
either  included  in  the  voxel  coloring  or  is  fully 
occluded  by  points  in  the  voxel  coloring.  An 
interesting  consequence  of  this  closeness  bias  is 
that  neighboring  image  pixels  of  the  same  color 
produce  cusps  in  the  voxel  coloring,  i.e.,  pro¬ 
trusions  toward  the  camera  volume.  This  phe¬ 
nomenon  is  clearly  shown  in  Figure  2(e)  where 
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the  white  and  black  points  form  two  separate 
cusps.  Also,  observe  that  the  voxel  coloring  is 
not  a  minimal  reconstruction;  removing  the  two 
closest  points  in  Figure  2(e)  still  leaves  a  con¬ 
sistent  scene. 

2.4  Computing  the  Voxel  Coloring 

In  this  section  we  describe  how  to  compute  the 
voxel  coloring  from  a  set  of  images.  In  addition 
it  will  be  shown  that  the  set  of  voxels  contained 
in  a  voxel  coloring  form  a  scene  reconstruction 
that  is  consistent  with  the  input  images. 

The  voxel  coloring  is  computed  one  voxel  at 
a  time  in  an  order  that  ensures  agreement 
with  the  images  at  each  step,  guaranteeing  that 
all  reconstructed  voxels  satisfy  Eq.  (1).  To 
demonstrate  that  voxel  colorings  form  consis¬ 
tent  scenes,  we  also  have  to  show  that  they 
are  complete,  i.e.,  they  account  for  every  image 
pixel  as  defined  in  Section  2.1. 

In  order  to  make  sure  that  the  construction  is  in¬ 
crementally  consistent,  i.e.,  agrees  with  the  im¬ 
ages  at  each  step,  we  need  to  introduce  a  weaker 
form  of  consistency  that  applies  to  incomplete 
voxel  sets.  Accordingly,  we  say  that  a  set  of 
points  with  color  assignments  is  voxel-consistent 
if  its  projection  agrees  fully  with  the  subset  of 
every  input  image  that  it  overlaps.  More  for¬ 
mally,  a  set  S  is  said  to  be  voxel-consistent  with 
images  Xi,.. . ,  if  for  every  voxel  F  €  and 
image  pixels  p  €  li  and  q  E  Ij,  V  =  S{p)  = 
S{q)  implies  color{p,li)  —  color{q,Ij).  For  no- 
tational  convenience,  define  Sv  to  be  the  set  of 
all  voxels  in  S  that  are  closer  than  V  to  the 
camera  volume.  Scene  consistency  and  voxel 
consistency  are  related  by  the  following  proper¬ 
ties: 

1.  If  <S  is  a  consistent  scene  then  {F}  U  is 
a  voxel-consistent  set  for  every  F  6  5. 

2.  Suppose  S  is  complete  and,  for  each  point 
F  €  <S,  F  U  is  voxel-consistent.  Then  S 
is  a  consistent  scene. 

A  consistent  scene  may  be  created  using  the  sec¬ 
ond  property  by  incrementally  moving  further 
from  the  camera  volume  and  adding  voxels  to 


the  current  set  that  maintain  voxel-consistency. 
To  formalize  this  idea,  we  define  the  following 
partition  of  3D  space  into  voxel  layers  of  uni¬ 
form  distance  from  the  camera  volume: 


{F  1  dist{V,C)=d} 

(2) 

i=l 

(3) 

where  di,. . .  ,dr  is  an  increasing  sequence  of 
numbers. 

The  voxel  coloring  is  computed  inductively  as 
follows: 

SPi  =  {F  I  F  e  voxel-consistent} 

SPk  =  {V\VeVd„ 

{F}  USPk-i  voxel-consistent} 
SP  =  {F  I  F  =  SPr{p)  for  some  pixel  p} 

We  claim  SP  =  S.  To  prove  this,  first  define 
Si  =  {V  \  V  e  S,dist{V,C)  <  di}.  Si  C  SPi 
by  the  first  consistency  property.  Inductively, 
assume  that  Sk-i  Q  SP^^i  and  let  F  €  Sk- 
By  the  first  consistency  property,  {F}  U  Sk-i 
is  voxel-consistent,  implying  that  {F}  U<SPfe_i 
is  also  voxel-consistent,  because  the  second  set 
includes  the  first  and  SPk-i  is  itself  voxel- 
consistent.  It  follows  that  S  C  SPr-  Note  also 
that  SPr  is  complete,  since  one  of  its  subsets 
is  complete,  and  hence  consistent  by  the  second 
consistency  property.  SP  contains  all  the  vox¬ 
els  in  SPr  that  are  visible  in  any  image,  and 
is  therefore  consistent  as  well.  Therefore  SP 
is  a  consistent  scene  such  that  for  each  pixel  p, 
SP{p)  is  at  least  as  close  to  C  as  >S(p).  Hence 
SP  =  S.  □ 

In  summary,  the  following  properties  of  voxel 
colorings  have  been  shown: 

•  «S  is  a  consistent  scene 

•  Every  voxel  in  5  is  a  color  invariant 

•  <S  is  directly  computable  from  any  set  of 
images  satisfying  the  ordinal  visibility  con¬ 
straint 

3  Reconstruction  by  Voxel  Coloring 

In  this  section  we  present  a  voxel  coloring  al¬ 
gorithm  for  reconstructing  a  scene  from  a  set 
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of  calibrated  images.  The  algorithm  closely  fol¬ 
lows  the  voxel  coloring  construction  outlined  in 
Section  2,  adapted  to  account  for  image  quanti¬ 
zation  and  noise.  As  before,  it  is  assumed  that 
3D  space  has  been  partitioned  into  a  series  of 
voxel  layers  ,  ■  •  ■ ,  increasing  in  distance 
from  the  camera  volume.  The  images  Ii , . . . ,  2m 
are  assumed  to  be  quantized  into  finite  non¬ 
overlapping  pixels.  The  cameras  are  assumed 
to  satisfy  the  ordinal  visibility  constraint,  i.e., 
no  scene  point  lies  within  the  camera  volume. 

If  a  voxel  V  is  not  fully  occluded  in  image  Ij, 
its  projection  will  overlap  a  nonempty  set  of  im¬ 
age  pixels,  TTj.  Without  noise  or  quantization 
effects,  a  consistent  voxel  should  project  to  a 
set  of  pixels  with  equal  color  values.  In  the 
presence  of  these  effects,  we  evaluate  the  cor¬ 
relation  of  the  pixel  colors  to  measure  the  like¬ 
lihood  of  voxel  consistency.  Let  s  be  the^tan- 

dard  deviation  and  n  the  cardinality  of  |J  itj. 

Suppose  the  sensor  error  (accuracy  of  irradi- 
ance  measurement)  is  approximately  normally 
distributed  with  standard  deviation  oq.  If  cxq 
is  unknown,  it  can  be  estimated  by  imaging  a 
homogeneous  surface  and  computing  the  stan¬ 
dard  deviation  of  image  pixels.  The  consistency 
of  a  voxel  can  be  estimated  using  the  following 
likelihood  ratio  test,  distributed  as  x^‘ 

(n  -  l)s 


3.1  Voxel  Coloring  Algorithm 

The  algorithm  is  as  follows: 


5  =  0 

for  i  =  do 

for  every  V  E  V^'  do 

project  to  2i,...,2m>  compute  Xy 
if  Ay  <  thresh  then  5  =  5  U  {V} 

The  threshold,  thresh,  corresponds  to  the  maxi¬ 
mum  allowable  correlation  error.  An  overly  con¬ 
servative  (small)  value  of  thresh  results  in  an 
accurate  but  incomplete  reconstruction.  On  the 
other  hand,  a  large  threshold  yields  a  more  com¬ 


plete  reconstruction,  but  one  that  includes  some 
erroneous  voxels.  In  practice,  thresh  should  be 
chosen  according  to  the  desired  characteristics 
of  the  reconstructed  model,  in  terms  of  accuracy 
vs.  completeness. 

The  problem  of  detecting  occlusions  is  greatly 
simplified  by  the  scene  traversal  ordering  used 
in  the  algorithm;  the  order  is  such  that  if  V  oc¬ 
cludes  V  then  V  is  visited  before  V.  Therefore, 
occlusions  can  be  detected  by  using  a  one-bit 
Z-buffer  for  each  image.  The  Z-huSer  is  ini¬ 
tialized  to  0.  When  a  voxel  V  is  processed,  tt* 
is  the  set  of  pixels  that  overlap  F’s  projection 
in  li  and  have  Z-buffer  values  of  0.  Once  Xy 
is  calculated,  these  pixels  are  then  marked  with 
Z-buffer  values  of  1. 

3.2  Discussion 

The  algorithm  visits  each  voxel  exactly  once 
and  projects  it  into  every  image.  Therefore,  the 
time  complexity  of  voxel  coloring  is:  0{voxels  * 
imdQes'^-  To  determine  the  space  complexity, 
observe  that  evaluating  one  voxel  does  not  re¬ 
quire  access  to  or  comparison  with  other  voxels. 
Consequently,  voxels  need  not  be  stored  during 
the  algorithm;  the  voxels  making  np  the  voxel 
coloring  will  simply  be  output  one  at  a  time. 
Only  the  images  and  one-bit  Z-buffers  need  to 
be  stored.  The  fact  that  the  complexity  of  voxel 
coloring  is  linear  in  the  number  of  images  is  es¬ 
sential  in  that  it  enables  large  sets  of  images  to 
be  processed  at  once. 

The  algorithm  is  unusual  in  that  it  does  not 
perform  any  window-based  image  matching  in 
the  reconstruction  process.  Correspondences 
are  found  implicitly  during  the  course  of  scene 
traversal.  A  disadvantage  of  this  searchless 
strategy  is  that  it  requires  very  precise  camera 
calibration  to  achieve  the  triangulation  accu¬ 
racy  of  existing  stereo  methods.  Accuracy  also 
depends  on  the  voxel  resolntion. 

Importantly,  the  approach  reconstructs  only 
one  of  the  potentially  numerous  scenes  consis¬ 
tent  with  the  input  images.  Consequently,  it  is 
susceptible  to  aperture  problems  caused  by  im¬ 
age  regions  of  near- uniform  color.  These  regions 
will  produce  cusps  in  the  reconstruction  (see 
Figure  2(e))  since  voxel  coloring  seeks  the  re- 
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(a)  (b)  (c) 

Figure  3:  Reconstruction  of  a  dinosaur  toy.  (a)  One  of  21  input  images  taken  from  slightly  above 
the  toy  while  it  was  rotated  360°.  (b-c)  Two  views  rendered  from  the  reconstruction. 


construction  closest  to  the  camera  volume.  This 
is  a  bias,  just  like  smoothness  is  a  bias  in  stereo 
methods,  but  one  that  guarantees  a  consistent 
reconstruction  even  with  severe  occlusions. 

4  Experimental  Results 

The  first  experiment  involved  reconstructing  a 
dinosaur  toy  from  21  views  spanning  a  360  de¬ 
gree  rotation  of  the  toy.  Figure  3  shows  the 
voxel  coloring  computed.  To  facilitate  recon¬ 
struction,  we  used  a  black  background  and  elim¬ 
inated  most  of  the  background  points  by  thresh¬ 
olding  the  images.  While  background  subtrac¬ 
tion  is  not  strictly  necessary,  leaving  this  step 
out  results  in  background-colored  voxels  scat¬ 
tered  around  the  edges  of  the  scene  volume.  The 
threshold  may  be  chosen  conservatively  since  re¬ 
moving  most  of  the  background  pixels  is  suf¬ 
ficient  to  eliminate  this  background  scattering 
effect.  Figure  3(b)  shows  the  reconstruction 
from  approximately  the  same  viewpoint  as  (a) 
to  demonstrate  the  photo  integrity  of  the  re¬ 
construction.  Figure  3(c)  shows  another  view 
of  the  reconstructed  model.  Note  that  fine  de¬ 
tails  such  as  the  wind-up  rod  and  hand  shape 
were  accurately  reconstructed.  The  reconstruc¬ 
tion  contained  32,244  voxels  and  took  45  sec¬ 
onds  to  compute. 

A  second  experiment  involved  reconstructing  a 
synthetic  room  from  views  inside  the  room.  The 
room  interior  was  highly  concave,  making  ac¬ 


curate  reconstruction  by  volume  intersection  or 
other  contour-based  methods  impractical.  Fig¬ 
ure  4  compares  the  original  and  reconstructed 
models  from  new  viewpoints.  New  views  were 
generated  from  the  room  interior  quite  accu¬ 
rately,  as  shown  in  (a),  although  some  details 
were  lost.  For  instance,  the  reconstructed  walls 
were  not  perfectly  planar.  This  point  drift  effect 
is  most  noticeable  in  regions  where  the  texture  is 
locally  homogeneous,  indicating  that  texture  in¬ 
formation  is  important  for  accurate  reconstruc¬ 
tion.  The  reconstruction  contained  52,670  vox¬ 
els  and  took  95  seconds  to  compute. 

5  Concluding  Remarks 

This  paper  presented  a  new  scene  reconstruc¬ 
tion  technique  that  incorporates  intrinsic  color 
and  texture  information  for  the  acquisition  of 
photorealistic  scene  models.  Unlike  existing 
stereo  and  structure-from-motion  techniques, 
the  method  guarantees  that  a  consistent  recon¬ 
struction  is  found,  even  under  severe  visibility 
changes,  subject  to  a  weak  constraint  on  the 
camera  geometry.  A  second  contribution  was 
the  constructive  proof  of  the  existence  of  a  set 
of  color  invariants.  These  points  are  useful  in 
two  ways:  first,  they  provide  information  that 
is  intrinsic,  i.e.,  constant  across  all  possible  con¬ 
sistent  scenes.  Second,  together  they  constitute 
a  volumetric  spatial  reconstruction  of  the  scene 
whose  projections  exactly  match  the  input  im¬ 
ages. 
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Figure  4:  Reconstruction  of  a  synthetic  room  scene,  (a)  The  voxel  coloring,  (b)  The  original 
model  from  a  new  viewpoint,  (c)  and  (d)  show  the  reconstruction  and  original  model, 
respectively,  from  a  new  viewpoint  outside  the  room. 
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Abstract 

We  propose  a  new  surface  fitting  scheme  called  “winged 
B-snakes”  for  reconstructing  smooth  surfaces  and  con¬ 
currently  preserving  discontinuity  edges.  The  input  data 
can  be  noisy,  sparse,  scattered  3D  data  on  a  surface  with 
unknown  crease  edges.  First,  we  use  a  vector  voting 
method  to  produce  three  potential  fields  for  surfaces, 
edges,  and  junctions.  The  potential  fields  are  stored  in 
three  volumetric  grids,  giving  each  voxel  the  probability 
of  being  a  surface  point,  an  edge  point,  and  a  junction 
point.  Then  we  drop  a  deformable  surface  in  the  potential 
fields  for  evolution.  The  surface  representation  is  the  lat¬ 
est  triangular  B-spline  model.  By  performing  energy 
minimization  in  the  potential  fields,  each  active  edge 
slides  to  align  with  discontinuity  edges,  and  its  wing 
patches  flap  to  fit  to  the  surface  data.  Finally,  a  smooth  C‘ 
surface  preserving  discontinuity  edges  and  junctions  is 
constructed. 

1  The  Problem 

Geometric  models  play  an  important  role  in  graphics, 
CAD/CAM,  visualization,  animation,  multimedia,  com¬ 
puter  vision,  and  many  other  fields  [1].  Geometric  mod¬ 
els  can  be  designed,  stored,  and  processed  as  volumetric 
primitives  or  surface  patches.  Complex  models  can 
be  built  more  efficiently  by  reconstructing  them 
from  an  already  existing  objects  or  clay-models 
than  by  designing  them  from  scratch  using  a 
CAD  software.  Surface  reconstruction  is  the  pro¬ 
cess  of  building  surface  models  from  sampled 
data. 

The  sampled  data  have  different  forms:  one  or  a 
few  intensity  images,  slices  extracted  from  CT/ 
MRI  data,  dense  regularly-gridded  range  images 
scanned  by  a  laser  beam,  sparse  coordinates 
measured  by  a  probe,  etc.  To  be  more  general,  we 

*  This  research  was  supported  in  part  by  the  Advanced  Re¬ 
search  Projects  Agency  of  the  Department  of  Defense  and  was 
monitored  by  Topographic  Engineering  Center  of  the  U.S.  Ar¬ 
my. 


use  noisy  sparse  scattered  3D  data  [2].  Never¬ 
theless,  our  proposed  new  scheme  is  also  apph- 
cable  to  dense  range  images  but  some 
intermediate  stages  can  be  made  more  efficient 
if  simpler  methods  are  used  for  dense  data  on  a 
regular  grid. 

In  recent  years,  deformable  surfaces  based  on  regular¬ 
ization  theory  seem  to  have  provided  a  systematic 
approach  to  handling  the  uncertainty  and  ill-posedness 
in  the  noisy  sparse  scattered  data,  and  thus  have  become 
the  focus  of  surface  reconstruction  research  [15].  Even 
in  CAD/CAGD,  the  deformable-surface-based  varia¬ 
tional  design  has  become  a  popular  scheme  [4]-[12]. 
The  spirit  in  deformable  surface  is  that  besides  the  exter¬ 
nal  fitting  energy  pulling  the  surface  toward  the  data 
points,  a  smoothness  constraint  as  an  intemal  energy  is 
introduced  to  regularize  the  surface  itself  so  that  the 
reconstructed  surface  will  not  go  wild  in  very  noisy 
regions  and  can  still  be  defined  over  gaps  where  there 
are  no  measured  data. 

However,  when  the  data  contains  multiple  objects,  or  an 
object  has  discontinuity  edges  on  its  surface,  such  over¬ 
all  regularization  may  fail,  hence  an  oversmoothed  sur¬ 
face  is  obtained,  and  the  discontinuity  features  may  be 
lost.  Also,  there  exist  overshoots/undershoots  and  ring¬ 
ings  (Gibbs  effects)  around  the  discontinuities.  So  we 
wish  to  be  given  the  discontinuity  edges  before  we  do 
surface  reconstruction.  However,  automatic  edge  detec¬ 
tion,  linking,  localization  and  classification  are  hard 
problems,  and  to  find  unknown  edges,  we  would  require 
a  faithful  surface  representation  for  the  raw  data.  Due  to 
this  circular  dependence  between  surface  reconstruction 
and  edge  detection,  researchers  have  realized  that  they 
are  two  coupled  processes  and  cannot  be  well  solved 
separately.  Based  on  this  observation,  we  propose  a 
novel  approach  to  surface  reconstmction  coupled  with 
edge  detection.  We  do  not  suppose  known  discontinui¬ 
ties  on  the  surface.  Our  approach  will  perform  surface 
reconstmction  and  discontinuity  detection  concurrently. 
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Guy  and  Medioni  [3]  have  implemented  an  automatic 
algorithm  for  discontinuity  detection  and  surface  fitting, 
which  produces  dense  triangular  meshes  and  discontinu¬ 
ity  curves  automatically  from  scattered  data.  In  this 
paper,  we  use  their  approach  to  infer  potential  informa¬ 
tion  for  surfaces  and  discontinuities,  but  the  surface  rep¬ 
resentation  is  not  the  planar  triangular  meshes.  Instead, 
we  use  smooth  triangular  B-spline  surfaces. 

Triangular  B-splines  [4]-[8]  overcome  the  shortcomings 
of  tensor-product  (TP)  B-splines  but  retain  the  g^ 
properties  of  automatic  continuity,  convex  hull,  etc.  TT- 
splines  have  a  rectangular  topology  and  thus  require 
tedious  trimming  techniques  to  handle  pole  artifacts  and 
irregular  boundaries;  also,  to  obtain  C‘  continuity,  the 
TP  B-splines  should  be  quadratic  in  terms  of  both 
parameters,  so  the  total  degree  is  four  (quartic);  whereas 
for  triangular  splines,  a  total  degree  of  as  low  as  two 
(quadratic)  can  maintain  the  C>  continuity;  duplicate 
knots  in  TP  splines  will  produce  a  discontinuity  curve 
across  the  whole  surface,  while  triangular  splines  can 
produce  a  local  discontinuity  edge.  Similarly,  triangular 
splines  allow  for  local  subdivision  for  refinement. 

Our  approach  starts  with  a  grouping  stage  to  infer  dense 
potential  information  from  the  sparse  data.  In  the  second 
stage,  a  deformable  triangular  surface  coupled  with 
active  edges  is  dropped  into  the  potential  fields.  We  call 
the  model  “winged  B-snakes”.  After  adjusting  the  con¬ 
trol  points  by  a  few  iterations  of  energy  minimization, 
the  surface  (wings)  flap  to  fit  the  data,  and  the  edges 
(snakes)  slide  to  align  with  the  actual  edges  in  the  data. 
Then  in  the  third  stage,  values  and  derivatives  along 
each  edge  are  checked,  so  that  discontinuities  can  be 
detected  and  preserved  in  constructing  the  surface.  The 
surface  can  also  be  extended  to  rational  forms  and  the 
weights  are  adjusted  for  surface  fine-tuning  and  fairing. 
An  introduction  to  triangular  B-splines  is  given  in  Sec¬ 
tion  2.  In  Section  3,  we  explain  the  principles  of  the 
winged  B-snakes,  with  some  experimental  results  given 
in  Section  4.  Further  topics  are  discussed  in  Section  4 
and  we  conclude  the  paper  in  Section  5. 


chosen;  and  over  each  set,  a  basis  function  is  defined. 

A  basis  function  B  over  the  5-knot  set  K  can  be  defined 
recursively  as  follows; 


B(,u\K)  =  aQB(u\K\vQ)  +  aiB(u\K\Vi)  +  a2B{u\K\v2) 


where  u  is  any  point  in  the  2-D  domain,  Og,  a,,  02  are 
related  to  the  barycentric  coordinates  of  u  w.r.t.  ^y 
three  knots  v^v^v^;  K\v,  is  the  5-knot  set  A  mmus  v„  i.e., 
a  4-knot  set.  As  the  K\v,  reduces  to  only  3  toote,  a 
zeroth-degree  B-spline  basis  function  is  obtained,  which 
is  a  flat  unit-height  triangle: 


J?(ulVo,  V,,  V2) 


1  if  Me[Vo,Vi,V2) 

0  otherwise 


It  can  be  shown  that  three  collinear  knots  will  result  m  a 
crease  edge  while  four  collinear  knots  will  produce  a 
step  edge.  If  the  nine  knots  are  pulled  apart  ^d  no  col¬ 
linear  knots  exist  at  all,  each  B-spline  basis  function 
automatically  becomes  C.  Linear  combination  of  the 
six  basis  functions  give  rise  to  a  local  patch,  and  over  the 
whole  domain  triangulation  (f  triangles)  a  surface  is 
obtained: 


r- 1  5 

(=0fc=0 

where  X  is  a  point  {x,y,z}  in  3D  space,  c,,t’s  are  the  scal¬ 
ing  factors  to  the  basis  functions  acting  as  control  points 
(adjacent  triangles  may  share  3D  control  points  to  guar¬ 
antee  C®,  or  share  their  xy  components  but  not  z  compo¬ 
nent  to  allow  a  step  edge).  If  a  weight  w,  ,,  is  associated 
to  each  control  point,  the  above  triangular  B-spline  sur¬ 
face  is  extended  to  triangular  NURBS  surface: 


2  The  Representation 

Triangular  B-splines  are  defined  over  2D  domain  trian¬ 
gulations,  and  C‘-'  continuity  can  be  achieved  by  k-th 
degree  polynomials  defined  over  the  domain  triangles. 
In  our  work,  we  consider  C‘  surfaces  using  quadratic 
polynomials  (k=2).  Given  an  arbitrary  triangulation  of 
the  2D  domain,  two  additional  points  are  added  near 
each  vertex  of  the  triangle  [Vj,  Vj,  vj  to  provide  nine 
knots  for  each  triangle  (two  adjacent  triangles  share  six 
knots).  Then  from  the  nine  knots,  five  knots  (including 
the  original  vertices  Vj,  vj,  Vj^,  plus  two  additional  knots) 
are  chosen  to  form  a  knot  set.  Six  different  knot  sets  are 


r  - 1  5 

£  £  b ' ^t,  6^“) 

Y  _  l  =  0b  =  0 _ _ 

A  —  j-_i  5 

t=0b=0 

Note  that  if  all  the  weights  are  equal,  the  triangular 
NURBS  specialize  to  triangular  B-splines;  furthermore, 
if  the  three  puUed-apart  knots  collapse  to  be  duplicate  at 
each  vertex,  then  the  overlap/blend  effect  mong  adja¬ 
cent  basis  functions  disappear,  and  thus  ^angular  B- 
splines  degenerate  to  triangular  Bezier-splines,  and  the 
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automatic  continuity  property  is  lost. 

If  the  above  domain  is  a  triangulation  of  the  surface  of  a 
unit  sphere,  a  closed  surface  is  defined.  The  only  differ¬ 
ence  is  that  the  summation  of  three  spherical  barycentric 
coordinates  are  not  1  anymore  (it  is  usually  greater  than 
1),  and  that  three  or  four  knots  on  a  same  great  circle 
will  produce  a  discontinuity.  Such  a  spherical  represen¬ 
tation  is  very  useful  in  geographical  and  other  applica¬ 
tions.  Since  the  splines  are  defined  over  arbitrary 
triangulations  of  the  sphere,  there  are  no  pole  artifact  as 
in  the  TP  splines,  and  the  continuity  is  automatically 
guaranteed.  The  simplest  case  is  the  tetrahedron  tessel¬ 
lation  of  the  unit  sphere,  in  which  as  few  as  four  trian¬ 
gles  suffice  to  exactly  model  a  sphere.  By  contrast,  to 
cover  a  sphere  with  rectangles,  at  least  six  rectangles 
(the  faces  of  a  cube)  are  required  to  obtain  the  continu¬ 
ity,  and  a  set  of  constraint  equations  have  to  be  main¬ 
tained. 

3  Surface  Reconstruction 

3.1  Inferring  Potential  Fields 

The  goal  of  the  first  stage  is  to  infer  dense  potential  mea¬ 
sures  from  the  scattered  data,  as  described  in  the  flow 
chart  in  Figure  1.  The  input  can  be  points,  segments,  or 
patches.  In  our  work,  we  only  use  points.  The  details 
about  the  method  can  be  found  in  Guy  and  Medium’s  pa¬ 
per  [3]. 

The  idea  is  to  locally  enforce  the  general  constraints, 
which  are  co-surfacity,  proximity,  and  constancy  of  cur¬ 
vature.  These  constraints  are  encoded  into  a  3D  vector 
mask.  Such  a  mask,  when  aligned  with  an  input  data  site, 
associates  a  preferred  direction  and  strength  to  every 
voxel  in  a  large  volume  of  space  around  the  input  site.  By 
aligning  the  field  with  each  input  site,  we  produce,  at 
each  voxel  location,  a  collection  of  vector  votes.  This 
voting  information  is  then  compressed  into  the  second 
order  moments  by  the  covariance  matrix,  graphically 
represented  by  an  ellipsoid,  or  equivalently,  by  three 
eigen-vectors.  The  eigen-values  /„,*  are  interpret¬ 
ed  as  three  saliency  measures  for  surfaces,  edges  and 
junctions,  and  the  eigen-vectors  are  used  to  estimate  the 
surface  normals. 

In  more  details,  Ld  is  used  as  the  saliency  of  a  sur¬ 
face  passing  through  a  location,  since  if  /mar"  Lid  is  large, 
/„,j,will  be  large  and  is  small,  also  is  small  (since 
lmm<  Lid)-  Thus,  there  is  only  one  strong  vote  group  here, 
i.e.,  the  consistency  of  votes  at  this  location  is  high.  In 
other  words,  the  probability  of  a  real  surface  passing 
through  this  location  is  high.  Similarly,  Lid  ~  ^min  is  used 
as  an  edge  saliency  measure,  and  is  used  as  junction 
saliency  measure. 


Note  that  this  voting  methodology  imposes  no  restriction 
on  the  number  of  objects,  genus  (topology),  number  of 
discontinuities,  and  the  algorithm  is  non-iterative  and  ef¬ 
ficient.  By  negating  the  above  three  saliency  measures, 
we  obtain  the  potential  fields  for  surfaces,  edges  and 
junctions.  The  minimum  potential  locations  of  the  three 
potential  fields  indicate  the  existence  of  surfaces,  edges 
and  junctions,  respectively. 


oriented  patches  points  curves 


± 


pre-processing 


estimating  normals 
estimating  likelihood 


oriented  patches 


vote  accumulation 


vector  convolution  and 
combination.  Encodes: 
proximity 
co-surfacity 
constancy  of  curvature 


dense 

moments  map 


vote  interpretation 


computing  saliency  maps 
from  eigenvalues  and 
eigenvectors  of  the 
voting  central  moments. 


surface  curve  junction 

saliency  map  saliency  map  saliency  map 


Figure  1.  Flow  chart  of  the  grouping  stage 


The  above  voting  procedures  are  performed  for  each 
voxel  in  a  3D  grid,  and  they  can  work  for  any  number  of 
surfaces  of  arbitrary  topology.  They  even  work  for  non¬ 
manifolds.  The  voting  complexity  is  O(n^k)  in  general, 
where  n  is  the  side  size  of  the  volume  grid,  and  k  is  the 
number  of  data  points. 

3.2  Edge-aligning  Surface  Deformation 

From  the  surface  potential  field,  a  triangular  mesh  can 
be  traced  out  by  the  “marching  cube”  algorithm  [25]. 
Different  from  the  standard  marching  cube  algorithm, 
the  surface  to  be  traced  out  is  now  the  minimum-poten¬ 
tial  surface  or  zero-derivative  surface  of  the  potential 
field,  instead  of  the  iso-surface  of  the  potential  values. 
Similarly,  the  discontinuity  curves  and  junctions  can  be 
traced  out  by  marching  methods  also.  However,  the 
mesh  is  quite  dense  with  each  facet  being  a  triangle;  the 
edges  and  junctions  are  traced  out  from  separate  fields 
and  are  not  integrated  into  the  surfaces. 

Our  goal  is  to  obtain  a  sparse  curved  surface  representa¬ 
tion,  with  the  discontinuity  edges  and  junctions  pre- 
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served  in  the  surfaces.  We  drop  a  triangular  spline 
surface  into  the  potential  fields,  and  deform  the  surface 
to  reach  minimum  energy.  Different  from  previous 
work,  we  couple  the  deformable  surface  model  with  the 
active  edge  model,  so  that  a  fitted  surface  together  with 
aligned  edges  will  be  obtained,  and  the  result  gives  a 
compact  integrated  representation  from  the  three  poten¬ 
tial  fields.  Since  each  surface  patch  can  deform  and  its 
edges  can  slide,  we  call  our  model  “winged  B-snakes”. 

We  define  the  energy  E  for  the  winged  B-snakes  as  fol¬ 
lows: 


E  =  +  hedges  ^junctions 


=  E,_,„,ooth  +  '‘■'r  -  fit  align  +  '^3  '  ^  j-align 

where  w,,  W2,  and  W3  are  coefficients  to  balance  the 
effects  of  the  three  types  of  energy,  and  are  chosen  by 
normalizing  the  above  four  terms  to  be  1  :  20  :  5  :  5 
before  the  start  of  minimization.  They  bring  the  four 
terms  to  comparable  orders  of  magnitude,  and  the  mini¬ 
mization  results  are  stable  with  their  variations.  We 
adjust  the  control  points  to  minimize  the  total  energy. 


(1)  Surface  Smoothness  Energy 

Although  each  triangular  patch  is  a  quadratic  polyno¬ 
mial  which  is  always  continuous,  we  still  need  the 
smoothness  energy  to  minimize  the  mesh  roughness. 
The  smoothness  energy  is  defined  in  terms  of  the  first- 
order  right-hand  derivatives: 


■'s  -  smooth 


t.X  uv 


where  the  summation  is  over  all  triangles  and  (x,y,z), 
and  the  integration  is  over  the  bary centric  coordinates 
(m,v)  within  a  triangle.  Since  the  goal  of  this  stage  is 
mainly  surface  fitting,  we  simply  use  the  above  mem¬ 
brane  energy.  The  more  costly  thin-plate  energy  based 
on  curvature  or  second-order  derivatives  is  used  in  the 
final  subtle  modification  and  fairing  stage. 


(2)  Surface  Fitting  Energy 

Inside  the  surface  potential  field,  the  surface  patches 
(wings)  flap  to  reach  the  local  minimum.  If  the  triangle 
edges  cross-over  the  actual  discontinuity  edges,  the  sur¬ 
face  energy  becomes  large;  when  the  edges  move  to 
align  with  the  actual  edges,  the  surface  energy  is 
reduced.  However,  we  found  that  the  surface  energy 
makes  the  edges  move  a  little  bit,  but  is  not  strong 
enough  to  pull  them  to  exactly  align  with  the  actual  dis¬ 
continuity  edges.  This  is  why  we  shall  introduce  an 
explicit  edge  alignment  energy  to  make  the  edges 
become  “active”  by  themselves. 

(3)  Edge  and  Junction  Alignment  Energy 


Inside  the  edge  potential  field,  the  triangle  edges 
(snakes)  can  slide  to  the  local  minimum  so  that  the  trian¬ 
gle  edges  may  align  with  the  actual  discontinuity  edges, 
instead  of  crossing  over  them.  Also,  the  triangle  vertices 
can  move  to  the  local  minimum  in  the  junction  potential 
field,  making  the  vertices  align  with  the  inferred  junc¬ 
tions.  Such  alignment  significantly  reduces  the  total 
energy  or  cooperatively  improves  the  surface  fitting  pre¬ 
cision,  and  we  thus  do  not  have  to  subdivide  the  trian¬ 
gles  into  many  tiny  ones  to  obtain  a  good  fitting  due  to 
the  misaligned  edges  and  junctions  as  in  most  previous 
methods. 

3.3  Obtaining  Surfaces  with  Creases 

After  the  surface  patches  have  been  fitted  to  the  data, 
and  the  edges  and  junctions  have  aligned  with  actual 
ones,  the  detection  of  discontinuity  edges  and  junctions 
from  the  surface  is  straightforward  [8].  We  simply  need 
to  check  the  value/derivative  differences  along  the 
boundary  between  every  pair  of  adjacent  triangles.  We 
normalize  the  differences  by  the  total  area  of  the  pair  of 
triangles.  Such  locally  adaptive  thresholding  works 
well.  We  then  check  the  change  of  angles  between  adja¬ 
cent  edge  segments  along  the  detected  edges,  and  those 
vertices  with  sharp  changes  are  marked  as  comers,  and 
those  with  more  than  two  neighbors  are  marked  as  junc¬ 
tions.  To  build  the  C  smooth  surface  while  preserving 
the  edges  and  junctions,  we  pull  apart  the  knots  in  the 
continuous  regions,  and  allow  knots  to  be  duplicate  or 
collinear/co-circular  to  respect  edges/junctions  (at  first 
all  three  knots  are  duplicate  at  each  vertex).  With  the 
new  knot  configuration,  we  adjust  the  control  points 
once  more. 

3.4  Fine-tuning/Fairing  by  Adjusting 
Weights 

In  the  above  fitting  and  alignment  procedures,  we  never 
adjust  the  weights.  The  reasons  are  that  we  want  to 
exploit  the  descriptive  power  of  control  points  as  much 
as  possible;  there  exist  much  redundancy  in  the  weights 
(scaling  of  weights  does  not  change  the  shape  at  all),  so 
leaking  information  to  the  weights  is  not  desired.  Also, 
adjusting  too  many  parameters  is  difficult  for  the  mini¬ 
mization  routines.  We  tested  adjusting  the  weights  at  the 
same  time  as  moving  the  control  points,  the  deformation 
and  alignment  slow  down  from  10  minutes  to  30  min¬ 
utes;  more  memory  is  also  needed  for  the  Hessian 
matrix  in  the  minimization  routine.  Due  to  the  c,  n-w,  t, 
multiplication  terms,  the  minimization  becomes  non-lin¬ 
ear.  After  some  experimental  tests,  we  foimd  that  if  the 
initial  guess  is  very  good,  simultaneous  adjustment 
results  in  a  little  improvement  (about  1~5%  further 
reduction  of  the  total  energy),  and  that  if  the  initial  guess 
is  not  good,  the  residual  energy  may  even  be  larger! 
Including  knots  in  the  minimization  makes  the  situation 
even  worse,  since  the  knots  are  buried  in  the  quadratic 
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basis  functions.  In  summary,  it  seems  that  adjusting  con¬ 
trol  points  and  weights  simultaneously  is  not  always 
better,  or  is  not  worth  the  cost,  even  if  it  can  bring  slight 
improvement. 

We  use  the  thin-plate  energy  as  the  surface  fairness  mea¬ 
sure,  which  is  based  on  the  second-order  derivatives. 
The  energy  for  fine-tuning  and  fairing  is  defined  as 

^  ^  surface- fairness  ^  ^surface  -  fit 

where  the  coefficient  c  is  chosen  such  that  the  two  terms 
are  normalized  to  1  :  5  before  the  minimization  starts. 
Since  the  surface-fitting  has  been  done  in  stage  3  where 
the  smoothness/fitting  ratio  is  1  :  20,  we  can  now  use 
smaller  ratio  (1  :  5).  To  minimize  this  energy,  only  small 
changes  of  weights  are  necessary,  and  we  did  not 
observe  negative  weights  occur  in  our  experiments.  Ini¬ 
tially  all  weights  are  1  ’s,  then  later  they  stay  between  0.5 
~  10.  We  did  not  use  a  negative  weight  penalty  term, 
such  as  {min(0,  w,  [22]-[24]. 

4  Experiments 

Figure  2(a)  is  the  randomly  sampled  data  (400  points)  of 
a  truncated  pyramid,  (b),  (c),  (d)  display  the  three  volu¬ 
metric  potential  fields  for  surfaces,  edges  and  junctions, 
respectively  (only  voxels  with  potentials  below  a  thresh¬ 
old  are  displayed).  This  stage  runs  for  about  10  minutes 
on  a  SUN  Sparc  10,  using  three  50x50x50  arrays. 

To  obtain  an  initial  surface,  we  regularly  split  the  bound¬ 
ing  box  of  the  data  points  into  rectangles  and  then  split 
each  rectangle  into  two  triangles  along  a  diagonal. 
Which  diagonal  is  chosen  depends  on  which  one  yields 
smaller  surface  fitting  energy  within  this  rectangle.  Such 
an  initial  surface  is  shown  in  Figure  2(e)  with  misaligned 
edges  and  junctions. 

Due  to  the  model’s  local  control  property,  a  control 
point  only  affects  a  triangle  and  its  surrounding  trian¬ 
gles,  so  it  is  reasonable  to  justify  that  a  local  minimiza¬ 
tion  may  not  sacrifice  the  global  optimality.  In  our 
implementation,  we  adjust  only  a  triangle  and  all  its 
neighboring  triangles  at  a  time,  and  then  move  to  the 
next  triangle.  We  use  the  Levenberg-Marquardt  algo¬ 
rithm  with  numerically  estimated  gradients.  Figure  2(f) 
shows  aligned/detected  edges  and  junctions  after  adjust¬ 
ing  the  control  points,  (g)  is  the  C‘  surface  with  edges 
and  junctions  preserved  by  pulling  apart  or  setting  col- 
linear  knots  then  adjusting  the  control  points  again. 
Finally,  (h)  gives  the  result  after  the  fine-tuning  and  fair¬ 
ing  by  adjusting  the  weights.  We  can  see  that  the  edges 
in  (i)  are  sharper  than  (h),  which  indicates  the  effect  of 
adjusting  the  weights  for  final-stage  subtle  improve¬ 
ment.  The  three  stages  for  reconstruction  from  the 
potential  fields  run  for  about  10  minutes  on  an  SGI/ 


Indigo. 

■V 


(b)  (c)  (d) 


(e)  (f) 


(g)  (h) 

Figure  2.  Open  pyramid 


To  test  the  triangular  B-splines  defined  over  a  unit 
sphere,  we  add  some  samples  at  the  bottom  of  the  pyra¬ 
mid  to  make  it  a  closed  surface,  as  shown  in  Figure  3(a). 
A  triangular  tessellation  of  the  unit  sphere  is  given  in  (b) 
by  subdividing  and  flipping  the  diagonals  on  the  faces  of 
a  unit  cube.  The  initial  surface  is  specified  by  a  sphere 
enclosing  the  data  set.  After  the  surface  fitting  and  edge 
alignment,  the  C'  result  with  preserved  edges  and  junc¬ 
tions  is  shown  in  (c).  Then,  (d)  gives  the  final  result  after 
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fine-tuning  and  fairing  by  adjusting  the  weights.  Every 
triangle  is  treated  equally,  and  no  pole  artifact  regions 
exist  at  all.  This  demonstrates  the  major  advantage  of 
triangular  splines  over  rectangular  tensor-product 
splines  for  modeling  spherical  data.  Previous  work 
using  rectangular  splines  had  to  tolerate  the  pole  degen¬ 
eracy,  or  use  two  or  more  pieces  of  surfaces  and  then 
glue  them  together. 


Figure  4.  An  airplane  (open  surface) 


(c)  (d) 

Figure  3.  Closed  pyramid 


Figure  4(a)  shows  900  scattered  samples  of  an  airplane. 
We  apply  a  Delaunay  triangulation  algorithm  to  the  scat¬ 
tered  data  and  produce  the  initial  triangular  mesh  (b),  in 
which  some  triangle  edges  severely  misalign  at  the 
boundary  between  the  wing  and  the  body  as  seen  from 
the  zoomed  display  in  (d).  After  surface  fitting  and  edge 
alignment,  the  edges  of  the  spline  triangles  align  with 
discontinuity  curves  in  the  data  and  can  be  detected  as 
shown  in  (e).  Finally,  fine-tuning  and  fairing  is  per¬ 
formed  to  give  the  result  in  (c).  Note  that  the  whole  sur¬ 
face  is  a  single  quadratic  spline  mesh.  By  contrast,  if 
TP-splines  were  used,  we  would  have  to  trim  a  rectan¬ 
gular  bi-quadratic  (quartic)  surface  into  the  airplane 
shape,  or  stitch-up  several  surface  pieces;  both  methods 
are  tedious. 
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(d) 

Figure  5.  A  tooth  (closed  shape) 


Figure  5(a)  shows  the  intensity  image  of  a  plaster  tooth 
and  (b)  shows  650  points  measured  on  it.  We  regularly 
subdivide  a  sphere  into  720  triangles  as  the  initial  sur¬ 
face.  A  top  view  of  the  final  C*  surface  with  detected 
and  preserved  edges  is  given  in  (c).  Another  view  is 
given  in  (d).  In  dental  CAD/CAM,  preserving  the  sharp 
edges  on  a  tooth  is  very  important,  otherwise  an  upper 
tooth  cannot  align  tightly  with  the  lower  tooth;  on  the 
other  hand,  the  sides  of  the  tooth  must  be  smooth,  or 
else  the  machined  crown  cannot  be  put  on  a  specific 
patient’s  tooth.  Our  winged  B-snakes  represented  with 
triangular  B-splines  seem  very  promising  for  smooth 
surfaces  with  embedded  discontinuities.  We  are  working 
on  merging  the  triangles  in  smooth  areas  for  model  sim¬ 
plification;  also  we  are  starting  to  work  on  more  compli¬ 
cated  objects,  such  as  mechanical  parts  and  medical  CT/ 
MRI  data  of  human  brains  and  organs. 


If  V- 


(a) 


(b) 

Figure  6.  A  banana 


The  final  example  is  a  banana.  Figure  6  (a)  is  the  sam¬ 
pled  data  (about  300  points).  A  planar  triangle  region  is 
used  as  an  initial  surface,  and  the  final  surface  with 
detected  and  preserved  crease  edge  is  shown  in  (b). 
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5  Discussion 

In  this  paper  we  only  describe  open  surface  and  spheri¬ 
cal  surface  modeling  from  scattered  data.  Compared 
with  tensor-product  splines,  triangualr  splines  offer 
more  flexibility.  Without  triangular  B-splines,  the  sur¬ 
face  reconstruction  and  embedded  edge  representation 
would  become  very  complicated.  We  are  m  the  process 
of  extending  the  scheme  to  arbitrary  topology  surfaces: 
first  trace  out  the  topology  information  (a  dense  triangu¬ 
lar  mesh)  from  the  potential  volume  fields  usmg  the 
Marching-cube  algorithm  [25],  then  perform  mesh 
reduction  and  use  the  simplified  mesh  as  an  imfral  sur¬ 
face.  After  surface-fitting/edge-alignment  with  the 

winged  B-snake  paradigm,  construct  a  G‘  overlap/blend 
surface  with  preserved  discontinuities  [26]  [27].  Loca 
subdivision/merging,  and  multiresolution  represente- 
tions  will  be  used  for  compact  representation  and  effi¬ 
cient  reconstruction. 

6  Conclusion 

In  this  paper  we  have  proposed  a  new  scheme  for  recon¬ 
structing  surfaces  from  sparse  noisy  scattered  data  that 
may  contain  unspecified  discontinuity  edges  and  junc¬ 
tions.  At  first,  we  use  a  vector  voting  technique  to  infer 
dense  surface/edge/junction  potential  information.  Then 
we  drop  a  quadratic  triangular  B-spline  deformable  sur¬ 
face  coupled  with  active  edges  in  the  inferred  potential 
fields.  After  some  energy  minimization  iterations  by 
adjusting  the  control  points,  the  discontinuity  edges  and 
junctions  are  automatically  aligned  and  detected,  and 
then  preserved  by  setting  the  knot  configurations  m  con- 
stracting  the  final  C‘  smooth  surface.  For  tasks  with  high 
precision  requirements,  fine-tuning  and  fairing  the  sur¬ 
face  by  further  adjusting  the  weights  in  the  triangular 
NURBS  is  also  effective. 
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Abstract 

We  describe  a  simple  extensible  framework  for 
interactively  delineating  image  curves  based  on 
user  input.  Like  snake  methods,  it  is  based  on 
the  optimization  of  a  potential  function  defined 
over  the  image,  and  is  thus  adaptable  to  a  wide 
range  of  linear  features.  Unlike  snakes,  how¬ 
ever,  it  is  based  on  the  computation  of  a  glob¬ 
ally  optimal  path  between  designated  points. 
Thus  results  are  stable,  repeatable  and  indepen¬ 
dent  of  the  dynamical  properties  necessary  to 
ensure  proper  snake  convergence.  We  will  also 
demonstrate  a  variety  of  ways  in  which  the  dy¬ 
namic  programming  delineation  (DPD)  may  be 
extended  to  deal  with  other  than  purely  linear 
features. 

1  Introduction 

Supporting  user-guided,  semi-automatic  extrac¬ 
tion  of  features  and  editing  automatically  ex¬ 
tracted  features  from  images  has  become  a  pri¬ 
mary  focus  of  lU  research  in  recent  years.  One 
of  the  most  successful  of  such  methodologies 
is  “snakes,”  in  which  a  smooth,  approximate 
curve  is  optimized  to  align  with  a  designated  lin- 
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ear  image  feature  [Kass  et  al.,  1987].  However, 
even  though  snakes  have  gained  very  wide  ac¬ 
ceptance,  they  have  significant  problems  which 
tend  to  boil  down  to  the  observation  that  final 
results  are  often  not  robustly  related  to  the  in¬ 
put  parameters  (e.g.  initial  curve  estimates  or 
snake  dynamics).  This  is  a  common  criticism 
of  gradient  descent  methods,  which  are  vulner¬ 
able  to  being  trapped  by  local  minima.  Snakes 
are  especially  vulnerable  because  of  their  high- 
dimensionality. 

An  alternate,  but  related  methodology  has  re¬ 
cently  been  rediscovered  which  replaces  the  gra¬ 
dient  descent  optimization  with  a  dynamic  pro¬ 
gramming  method.  Simply  put,  the  image  is 
considered  as  a  planar  graph  with  natural  adja¬ 
cencies  and  the  minimal  cost  path  is  computed 
from  one  pixel  to  another  using  a  dynamic  pro¬ 
gramming  algorithm.  First  described  in  this 
form  as  part  of  an  automatic  system  for  extract¬ 
ing  road  networks  from  overhead  imagery  [Fis- 
chler  et  al.,  1981],  it  was  initially  solved  with  a 
multi-pass  algorithm  named  FSTAR.  The  prin¬ 
ciple  was  later  rediscovered  and  recast  in  an  in¬ 
teractive  form  in  [Mortensen  and  Barrett,  1995], 
in  which  a  variant  of  Dijkstra’s  minimal  cost 
path  algorithm  was  used.  Largely  because  of 
the  thousandfold  increase  in  available  desktop 
computational  power,  what  was  an  off-line  tech¬ 
nology  in  1981  had  become  a  fast,  interactive 
technique  in  1995. 

Subsequent  research  demonstrated  an  isomor¬ 
phism  with  numerical  methods  for  solving  cer¬ 
tain  initial  value  PDF’s  [Cohen  and  Kimmel, 
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1996]  and  this  revealed  useful  means  of  con¬ 
straining  the  extracted  curves  (e.g.  by  bound¬ 
ing  curvature).  In  addition,  a  search  for  more 
explicit  means  of  constraining  curvature  demon¬ 
strated  a  way  of  extending  the  graphs  beyond 
a  simple  image  isomorphism  to  explicitly  rep¬ 
resent  other  properties  [Merlet  and  Zerubia, 
1996]. 

We  present  a  formalization  of  these  related 
methods  which  reveals  the  richness  of  this 
paradigm.  By  considering  the  similarities  and 
differences  between  the  alternatives  already  ex¬ 
plored  we  propose  a  number  of  innovative  uses 
of  this  technology,  and  avenues  for  further  ex¬ 
ploration. 

2  General  Methodology 

Dynamic  programming  delineation  (DPD)  on 
images  begins  by  defining  a  isomorphism  be¬ 
tween  an  image  and  a  graph  in  which  each  pixel 
is  associated  with  a  vertex  and  the  transitions 
between  neighbouring  pixels  are  associated  with 
graph  edges.  For  a  discrete  image  I  : 
we  define  a  graph  Gj  =  (V,  E)  such  that  there 
is  a  one-to-one  mapping  from  image  pixels  to 
graph  vertices 

V  =  {vi  I  i  e  /}. 

The  graph’s  edges  correspond  to  pixel  adjacen¬ 
cies  in  the  image 

E  =  {eij  I  j  e  Neigh(0}. 

Given  this  graph,  we  formulate  the  delineation 
problem  as  the  solution  to  a  minimal  path  prob¬ 
lem  given  two  points  io,  ii  £  I  and  a  non¬ 
negative  potential  function  P  :  E  3?"^.  The 
straightforward  solution  to  this  problem  is  a  dy¬ 
namic  programming  algorithm  known  as  Dijk- 
stra’s  minimal-path  algorithm  [Cormen  et  al, 
1990,  p.  527].  It  manages  four  data  structures: 
5  the  set  of  vertices  for  which  we  know  the  min¬ 
imal  cost  path  from  vq  to  u;  an  array  C[v\  G  3?, 
representing  this  cost;  Q  =  {{v,Qy)},  a  pri¬ 
ority  queue  of  vertices  ordered  by  the  partial 
path  cost  Qy-,  and  the  array  7r[u]  €  V  such  that 
7r[u]  =  V  means  that  u  precedes  v  on  the  minimal 
path  to  V.  The  algorithm  is  then  as  outlined  in 
Fig.l. 


Dijkstra  (G,  um  *^i)- 
5  f-  0 

Q  ^  {(W)  0)} 

foreacb  v  E.V  : 

G[u]  oo 

7r[u]  ^  nil 
loop 

let  {u,  Qu)  t-EXTRACT-MlN  (Q) 
5<g-5u{u} 

C\u\  q-  Qu 

foreach  v  e  {v  \  Cyy  £  E}  : 
let  dy  <—  C[u]  -h  P{e  uv) 

if  {v,  Qy)  iQ  or  dy  <Qy  then 
Q  ^  Q  LI  (v,  dy) 

7r[u]  t-  u 

while  It  7^  ui . 

Figure  1:  Dijkstra’s  minimal  path  algorithm. 

If  this  algorithm  is  then  adjusted  so  that  the 
cost  used  to  order  the  queue  is  adjust  by  a  lower- 
bound  estimate  of  the  cost  from  v  to  the  goal 
position  vi  (typically  a  constant  multiple  of  the 
distance  ||ui  -  u|l)  then  the  process  is  biased 
toward  the  goal  and  the  algorithm  is  called  A* 
[Duda  and  Hart,  1973].  Using  a  traditional  bi¬ 
nary  heap,  this  algorithm  is  0{ElogV). 

Since  DPD  is  an  interactive  methodology,  we 
frame  this  algorithm  in  a  structure  which  re¬ 
flects  point  selection  and  motions.  In  the  ba¬ 
sic  form,  the  user  initiates  the  process  with  a 
mouse  click  on  the  image  that  selects  Uq-  As  the 
pointer  moves,  the  dynamic  programming  algo¬ 
rithm  is  run  from  vq  to  ui,  the  current  mouse 
position.  The  display  is  then  updated  by  draw¬ 
ing  a  curve  over  the  image  from  vi  to  vq  by 
backtracking  through  tt.  Rather  than  reinitial¬ 
izing  the  process  throughout,  S,Q,C,  and  n  are 
retained  as  long  as  vq  is  constant  (the  costs  in 
Q  need  to  be  readjusted  if  A*  is  used.)  Thus  if 
ui  e  S,  we  need  not  run  Dijkstra  at  all,  instead 
simply  drawing  the  curve.  A  subsequent  mouse 
click  then  freezes  the  location  of  ui  and  saves 
the  path. 

To  this  point,  we  have  assumed  only  that  the 
cost  P{euy)  is  an  arbitrary  non-negative  traver¬ 
sal  cost  for  the  edge  e„^.  In  order  to  actually 
select  particular  image  features,  we  need  to  spe- 
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cialize  this  cost.  In  its  simplest  form,  we  can 
simply  reduce  this  to  a  cost  function  for  the 
terminal  pixel  of  the  edge,  which  we’ll  desig¬ 
nate  P{v).  In  this  special  case,  if  we  use  four- 
connected  neighbourhoods  for  Neigh(u)  we  are 
left  with  a  method  similar  to  the  Fast  March¬ 
ing  Method  (introduced  by  Sethian  in  [Sethian, 
1996])  for  solving  for  C  in  the  initial  value  prob¬ 
lem 

II veil  =  A  where  C[vo]  =  0. 

The  minimal  path  we  construct  by  backtracking 
along  TT  is  then  a  discrete  approximation  to  the 
curve  C  for  which 

I  = 

Cohen  and  Kimmel  [Cohen  and  Kimmel,  1996] 
used  this  formulation  to  prove  that  if  P  is  of  the 
form  P(v)  =  P(v)  +  w,  for  P(v)  positive  and  w  a 
positive  constant,  then  the  curve  generated  has 
the  property  that 


insensitive  to  edge  orientation.  We  would  usu¬ 
ally  prefer  to  select  curves  which  have  a  well- 
defined  edge  orientation.  The  key  to  achieving 
this  goal  is  the  realization  that  P  is  really  a 
function  of  the  edge  e^v  For  oriented  edges 
then  we  can  simply  consider  the  cross-section 
of  /  perpendicular  to  uv,  which  we’ll  call  /J],. 
A  simple  function  which  selects  rising  edges  is 
then 

P{uv)  =  dmax  “  T 

where  dmax  =  sup{d/^}.  Following  the  reason¬ 
ing  outlined  in  [Iverson  and  Zucker,  1995],  this 
can  be  made  even  more  selective  for  edges  by 
combining  with  other  derivatives,  for  example 

P{UV)  =  d„ax  -  +  W, 

where 

c^max  =  sup{d/J^}  - 


|K|  < 

w 

where  k  is  the  curvature  of  C.  Thus  u;  is  a  reg¬ 
ularizing  parameter  for  the  curves  extracted. 

3  Applications 

Given  the  restricted  form  of  image  potentials 
defined  above,  we  have  the  ability  to  select  for 
a  wide  range  of  image  features.  For  example: 

P{v)  =  \I{v)  -  /o|  -I-  u; 

will  select  for  curves  that  follow  paths  of  inten¬ 
sity  Iq.  Thus  setting  Iq  =  sup{/}  or  Iq  —  inf{/}, 
DPD  will  extract  bright  or  dark  lines  respec¬ 
tively.  This  is  remarkably  effective  for  road  ex¬ 
traction  on  low-resolution  imagery  (see  Fig.  2). 

If  we  are  interested  in  edges,  a  simple  image 
transform  will  be  effective,  with 

P(u)  =  |V2/(u)|-f-u;, 

tracking  the  zeroes  of  the  Laplacian,  the  basic 
building  block  of  the  classic  Marr-Hildreth  edge 
operator  [Marr  and  Hildreth,  1980].  However, 
this  formulation  has  the  disadvantage  of  being 


DPD  can  also,  perhaps  less  obviously,  be  used 
as  a  higher-level  linking  method  for  some  other 
low-level  feature  detector.  This  approach  was 
adopted  in  a  project  in  which  the  goal  was  to 
detect  and  delineate  rivers  and  streams  from 
overhead  imagery  [Fua,  1996].  A  multi-baseline 
stereo  algorithm  was  used  to  reconstruct  de¬ 
tailed  elevation  maps  over  a  selected  image. 
An  automatic  delineation  method  [Fischler  and 
Wolf,  1983]  was  then  used  to  select  those  points 
M  in  the  image  at  which  linear  structure  was 
evident.  This  bitmap  was  then  used  as  a  mask 
for  terrain  curvature  data  k  to  construct  a  cost 
image 

Km  -  k{v)  if  G  M  :  IIu  -  nil  <  e; 
K  otherwise. 

where  Km  =  sup{K(t;)},  K  >  Km  and  c  is  a 
small  mask  dilation  (e.g.  3  pixels).  As  can 
be  seen  in  Fig.  3,  the  DPD  process  has  the 
effect  of  linking  together  selected  points  in  an 
intuitive  fashion  while  simultaneously  choosing 
maximal  curvature  points  within  the  masked  re¬ 
gions.  Thus  we  have  a  method  for  effectively 
and  intuitively  combining  a  logical  feature  se¬ 
lection  method  with  a  scaled  measure. 
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Figure  2:  Low  resolution  road  extraction  using  simple  DPD  with  /q  -  1.  Delineation  of  the  entire 
road  network  took  approximately  10  seconds  and  10  mouse  clicks. 


4  Extended  Graphs 

So  far  we  have  considered  only  situations  in 
which  there  was  a  perfect  isomorphism  between 
the  input  image  and  the  graph  on  which  we  are 
seeking  optimal  paths.  One  way  in  which  we 
can  loosen  the  constraints  on  the  nature  of  the 
image  features  we  select  for  (and  thus  enrich  our 
representational  ability)  is  to  allow  for  extension 
of  the  graph. 

This  idea  was  first  introduced  in  [Merlet  and 
Zerubia,  1996],  in  which  an  explicit  curvature 
constraint  was  imposed  by  associating  eight 
graph  vertices  with  each  image  pixel,  one  for 
each  direction  used  to  enter  the  pixel.  Thus, 
using  the  same  format  as  above,  G'j  =  {V' ,  E ) 
such  that  there  is  a  mapping  from  image  pixels 
to  graph  vertices 

V'  ^{Vix\iei 

where  L  —  {1, . .  ^  labels,  for  some 

integer  n.  The  graph’s  edges  correspond  to  local 
neighbourhoods  in  the  vertex  set 

E'  =  {euv  \ueV'  Av  e  Neigh'(M)}. 

In  [Merlet  and  Zerubia,  1996],  the  neighbour¬ 
hoods  were  restricted  so  that  the  edges  from 
pixel  i  to  j  e  Neigh(i)  connect  to  vertex  Vjx 
where  A  was  the  direction  of  the  path  ij.  Fur¬ 


thermore,  Cuv  only  existed  if  the  direction  im 
plicitly  specified  by  vertex  u  is  less  than  90  de¬ 
grees  different  than  the  direction  of  Cuv  In  this 
way,  they  were  able  to  guarantee  that  paths 
with  local  bends  greater  than  45  degrees  were 
impossible.  As  we  noted  above,  the  same  con¬ 
dition  can  be  imposed  with  an  appropriate  con¬ 
stant  w  added  to  the  potential  P,  but  the  prin¬ 
ciple  exposed  by  Merlet  and  Zerubia  is  of  more 
general  usefulness. 

For  example,  if  we  wish  to  extend  the  road  ex¬ 
traction  problem  suggested  earlier  to  higher  res¬ 
olution  imagery,  then  we  will  have  to  deal  with 
the  problem  of  road  width.  One  way  to  ex¬ 
plicitly  capture  the  width  of  the  road  at  every 
point  along  its  length  is  to  introduce  a  label  set 
L  =  {!,...,  u}  where  the  label  X  £  L  speci¬ 
fies  a  road  of  width  2^/^.  Since  roads  have  rea¬ 
sonably  continuous  width,  we  will  restrict  the 
graph  edges  so  that  the  edge  e^v  only  exists  for 
u  =  {i,  Xi)  and  v  =  {j,  Xj)  when  j  £  Neigh(i) 
and  \Xi  -  Xj\  <  1.  Thus,  even  though  the  ex¬ 
tended  graph  G'j  now  has  n  times  as  many  ver¬ 
tices  as  G/,  there  are  only  three  times  as  many 
edges  in  E',  and  we  should  still  be  able  to  run 
the  DPD  process  at  interactive  rates.  Moreover, 
if  the  potentials  P{uv)  are  calculated  with  re¬ 
spect  to  a  pre-computed  multi-resolution  image 
pyramid,  then  the  evaluation  of  the  potential 
can  be  restricted  to  a  small  local  neighbourhood 
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Figure  3:  River  delineation  for  three  creeks  flowing  down  a  slope  (a).  Note  that  using  pure 
curvature  (b)  for  the  delineation  potential  P  fails  to  locate  certain  creekbeds  (c).  If 
instead  the  curvature  is  masked  with  a  linear  feature  mask  (d),  then  the  delineation  is 
accurate  (e).  Each  creekbed  was  generated  with  two  mouse  clicks. 
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in  the  pyramid  by  simply  selecting  the  appro¬ 
priate  pyramid  level  given  A.  Thus,  the  system 
will  find  the  minimal  cost  path  through  a  scale 
space  representation  of  roads. 

5  Conclusion 

We  have  described  a  notable  body  of  new  re¬ 
search  which  has  enabled  a  significant  improve¬ 
ment  in  the  ability  of  analysts  and  others  work¬ 
ing  with  images  to  define  and  interactively  ex¬ 
tract  linear  image  structures.  Because  the 
methodology  uses  a  fast,  global  optimization 
method,  it  is  significantly  more  stable  than  pop¬ 
ular  alternatives  like  snakes.  Even  when  re¬ 
sults  are  inadequate,  the  fast  interactive  frame¬ 
work  allows  for  the  kind  of  immediate  interac¬ 
tion  which  is  necessary  to  get  around  these  kinds 
of  problems. 

This  technology  is  in  its  infancy  and  we  have 
outlined  a  few  directions  for  future  growth. 
Even  in  its  current  form,  it  is  clearly  useful  and 
should  find  its  place  in  standard  image  analysis 
and  manipulation  toolkits. 
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Abstract 

This  paper  is  primarily  concerned  with  the 
problem  of  finding  a  single  perceptually  obvi¬ 
ous  path  (POP)  in  an  image;  e.g.,  an  isolated 
road  in  an  overhead  view  of  a  desert  scene,  or 
a  particular  line,  drawn  on  a  piece  of  paper, 
that  a  person  points  at.  We  briefly  describe 
those  relevant  parts  of  a  system  designed  to  ad¬ 
dress  the  general  problem  of  automatically  de¬ 
lineating  line-like  structures,  but  focus  on  the 
perceptual,  semantic,  and  computational  issues 
relevant  to  this  particular  problem. 

1  Introduction 

This  paper  is  primarily  concerned  with  the 
problem  of  finding  a  single  perceptually  obvious 
path  (POP)  in  an  image  (or  selected  image  win¬ 
dow);  e.g.,  an  isolated  road  in  an  overhead  view 
of  a  desert  scene,  or  a  particular  line,  drawn  on 
a  piece  of  paper,  that  a  person  points  at. 

In  reference  [5]  we  described  an  architecture 
(Figure  1:  LD  block  diagram)  that  we  have 
found  to  be  both  general  and  effective  for  ad- 
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dressing  the  delineation  problem;  it  involves  the 
following  subsystems  and  processes: 

(a)  Detector/Binarizer  Subsystem.  Binariza- 
tion  of  the  gray-level  image  retaining  the  per¬ 
ceptual  saliency  of  the  linear  structures  (e.g.. 
Figure  2b  or  3b); 

(b)  Generic  Linear  Delineation  Subsystem.  Par¬ 
titioning  and  linking  the  binary  markers  into  a 
collection  of  independent  (perceptually  obvious) 
generic  open  paths  (e.g..  Figure  Ic,  2c). 

(c)  Semantic  Linear  Delineation  Subsystem. 
Splitting,  semantic  filtering,  and  relinking  the 
generic  (perceptually  salient)  paths  to  obtain 
semantically  significant  delineations.  Our  goal 
here  could  be  to  find  a  collection  of  independent 
paths  (open,  closed,  or  both),  a  linked  network 
(with  or  without  explicit  path  extraction),  or  to 
find  a  single  ’’best”  path. 

We  briefly  describe  relevant  parts  of  the  above 
system,  but  focus  on  the  perceptual,  semantic, 
and  computational  issues  relevant  to  this  par¬ 
ticular  problem,  especially  the  Semantic  Delin¬ 
eation  Subsystem  and  our  proposed  solution  for 
the  final  process  -  relinking  a  subset  of  the  fil¬ 
tered  line  segments  into  a  single  POP. 

2  Overall  Rational  and  Main 
Problems  to  be  Solved 

Given  two  curves,  we  typically  have  no  precise 
quantitative  procedure  for  determining  which 
of  the  two  would  be  more  perceptually  salient 
to  a  normal  human  observer.  By  perceptually 
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salient  we  mean  that,  after  a  very  brief  inspec¬ 
tion,  one  alternative  would  be  chosen  over  the 
other  as  depicting  the  presence  of  some  inter¬ 
esting/important  natural  or  man-made  feature, 
or  coherent  structure,  in  an  image  of  a  natural 
scene;  or  likely  to  be  found  first  if  both  were 
judged  to  be  coherent;  or  judged  to  be  a  bet¬ 
ter  exemplar  of  some  semantic  category.  The 
available  ranking  criterion  for  generic  curves  is 
largely  limited  to  the  qualitative  Gestalt  laws 
of  perceptual  organization  [l]  -  proximity,  clo¬ 
sure,  simplicity,  similarity,  good  continuation 
(smoothness),  and  symmetry.  In  addition  we 
might  have  some  semantic  or  physical  con¬ 
straints  that  could  be  used  to  disqualify  a  curve 
from  being  a  member  of  a  target  semantic  cate¬ 
gory.  For  example,  a  curve  that  ”doubled-back” 
on  itself  (i.e.,  was  multi-valued  for  azimuth) 
would  typically  not  be  a  valid  skyline  for  an  im¬ 
age  taken  with  a  horizontally  held  camera;  or, 
an  isolated  closed  curve  in  an  aerial  image  would 
be  unlikely  to  depict  an  interstate  freeway.  In 
general,  at  this  time,  we  can  only  be  expected 
to  make  gross  judgments  in  our  ranking  -  e.g., 
to  find  a  best  (perceptually  salient)  curve  when 
there  is  really  only  one  viable  candidate  in  the 
search  area. 

Because  of  occlusions  and  background  struc¬ 
ture,  there  generally  is  no  simple  way  to  parti¬ 
tion  the  image  into  curves,  associated  with  co¬ 
herent  objects,  that  are  complete  and  have  no 
contamination  by  extraneous  background  con¬ 
tent.  If  we  tried  to  list  or  assemble  all  pos¬ 
sible  curves  prior  to  ranking  them,  an  image 
with  as  few  as  20-40  curve-points  would  be  com¬ 
putationally  impractical  to  process  because  of 
the  factorial  growth  in  the  possible  number  of 
curves.  What  is  implied  by  the  above  consider¬ 
ations  is  that  a  single  step  solution  to  the  prob¬ 
lem  of  selecting  a  single  most  salient  curve  is 
probably  not  attainable;  we  must  perform  a  se¬ 
quence  of  grouping,  filtering,  and  information- 
reduction  steps  to  eliminate  unlikely  candidates 
as  early  in  the  selection  process  as  possible,  and 
then  make  our  final  selection  on  a  greatly  simpli¬ 
fied  reduction  of  the  originally  presented  data. 

We  have  examined  two  distinct  approaches  to 
the  delineation  problem  in  general,  and  to  find¬ 
ing  the  POP  in  particular:  (a)  Dynamic  Pro¬ 


gramming  (DP)  [2]  which  is  capable  of  finding  a 
least-cost  path  in  a  real- valued  2-D  array  (which 
could  be  the  original  picture,  or  some  derived 
overlay  called  a  "cost”  image),  and  (b)  a  number 
of  graph-theoretic  techniques  which,  in  practice, 
require  an  early  binarization  of  the  input  image. 

DP,  or  any  other  global  optimization  technique 
that  can  operate  on  the  actual  input  data  be¬ 
comes  computationally  infeasible  for  anything 
other  than  cost/objective  functions  that  are 
very  "local”  in  nature.  I.E.,  the  cost  of  a  path 
going  through  a  particular  pixel  in  an  image 
should  only  be  a  function  of  an  attribute  list 
attached  to  that  pixel  and  (say)  the  cost  of  ap¬ 
pending  the  given  pixel  to  a  path  that  passes 
through  an  adjacent  pixel  -  rather  than  being 
dependent  on  (say)  the  specific  positioning  of 
the  previous  five  pixels  in  the  curve  segment  to 
which  attachment  is  being  considered.  Thus, 
the  nominal  generality  of  full  global  optimiza¬ 
tion  is  not  really  attainable  because  of  compu¬ 
tational  considerations.  Even  if  we  could  con¬ 
tend  with  the  computational  difficulties,  there 
is  the  further  problem  of  actually  specifying 
the  global  cost/objective  function  that  approx¬ 
imately  models  human  perceptual  behavior  in 
interpreting  graylevel  images  -  this  is  an  even 
more  difficult  unsolved  problem. 

In  the  approach  we  will  now  discuss,  we  have 
found  (through  a  combination  of  theory  and 
experiment  -  but  this  is  primarily  an  empir¬ 
ical  result)  that  it  is  possible  to  automati¬ 
cally  construct  a  binary  overlay,  of  almost  any 
non-contrived  graylevel  image,  that  will  retain 
the  perceptual  saliency  of  the  linear  structures 
(paths).  It  is  further  the  case  that  it  is  now  (in 
a  binary  image)  possible  to  define  the  primary 
cues  that  underlie  our  perception  of  a  line  or 
path:  relative  proximity  and  smoothness  of  the 
binary  (1  or  0)  pixels  defining  the  line/path.  Al¬ 
though  not  a  traditional  Gestalt  property,  per¬ 
sistence  (e.g.,  coherent  path  length)  is  also  cue 
of  major  importance;  the  other  Gestalt  cues 
play  a  (sometimes  dominant)  role  only  when 
there  is  ambiguity  due  to  contending  interpreta¬ 
tions,  or  when  we  recognize  some  known  shape 
or  repeated  structure. 

Generic  (perceptual  rather  than  application  de- 


958 


pendent)  clustering  and  linking  are  effectively 
(but  not  perfectly)  achieved  by  employing  a 
modified  Minimum  Spanning  Tree  (MST)  al¬ 
gorithm  with  a  bound  on  inter-point  distance. 
The  MST  algorithm  we  devised  for  this  purpose 
can  be  made  to  run  in  time  proportional  to  the 
number  of  points  being  processed  (because  the 
points  are  represented  by  bounded  integer  coor¬ 
dinates,  their  density  is  not  arbitrary). 

The  result  of  the  above  steps  is  a  collection  of 
disjoint  MST’s  which  can  be  separately  parsed 
to  to  provide  a  collection  of  line-segments 
(RPATHS)  as  the  final  output  of  the  generic 
linking  component  of  our  system.  This  pars¬ 
ing  process  involves  (1)  finding  a  primary  path 
through  the  tree  (typically  a  diameter  path), 

(2)  trimming-back  branches  with  ragged  ends, 

(3)  pruning  short  branches,  (4)  partitioning  the 
remaining  collection  of  branches  into  disjoint 
paths  which  are  pair-wise  linked  at  the  MST 
nodes  according  to  geometric  and  (original- 
image)  intensity  smoothness  criterion.  An  ex¬ 
ample  showing  the  result  of  this  process  is  pre¬ 
sented  in  Figures  2c  and  3c. 

2.1  On  the  Combinatorics  of  Finding 
A  Perceptually  Obvious  Path 

Assume  that  we  start  with  a  binarized  image 
depicting  a  single  POP.  If  we  had  a  criterion 
function  (CF)  that  allowed  us  to  rank  alterna¬ 
tive  POP  candidates,  we  observe  that  the  naive 
solution  of  generating  and  ranking  all  possible 
paths  is  computationally  Infeasible  for  any  real¬ 
istic  problem.  Since  there  are  n!  possible  paths 
on  n  points,  and  20!  >  10^*,  a  problem  with  as 
few  as  20  points  would  be  impossible  to  solve 
this  way. 

In  general,  we  must  address  two  sub-problems: 

(1)  selecting/partitioning  the  actual  path- 
points  from  the  set  of  potential  path-points,  and 

(2)  sequencing  the  selected  path-points.  Let  us 
assume  that  we  are  given  the  points  that  ac¬ 
tually  constitute  the  solution  (POP).  A  very 
reasonable  CF,  based  on  the  primary  Gestalt 
property  of  proximity,  is  density  (number- 
of-path-points/path-length);  i.e.,  we  want  to 
find  the  shortest  path  that  contains  all  the 


given/selected  points.  What  we  have  just  es¬ 
tablished  is  that  a  simplification  (sub-problem) 
of  our  original  problem  is  the  Traveling  Sales¬ 
man  Problem  (TSP)  if  the  POP  is  closed,  or 
the  problem  of  finding  a  ’’Messenger”  (open) 
path.  Both  the  TSP  and  the  ’’Messenger”  path 
problem  are  known  to  be  computationally  in¬ 
tractable  for  large  values  of  n  (NP-hard).  For 
example,  (at  least)  until  recently,  the  largest 
value  of  n  for  which  there  is  a  known  solution  to 
a  non-contrived  TSP  was  318  cities  [7][6].  While 
there  are  fast  methods  for  finding  an  approxi¬ 
mation  to  the  solution  of  a  Eucledian  TSP  prob¬ 
lem,  the  perceptual  character  of  such  a  solution 
is  uncertain. 

It  is  clear  that  in  order  to  solve  the  POP  prob¬ 
lem  we  must  strictly  limit  the  the  number  of 
points  that  can  be  arbitrarily  sequenced,  or  we 
must  limit  the  number  of  choices  that  are  the 
possible  successors  of  any  given  point,  or  use 
some  combination  of  the  two  preceding  con¬ 
straints.  In  a  variety  of  problems  domains 
that  we  have  been  concerned  with  (e.g.,  find¬ 
ing  roads  in  aerial  images,  recognizing  trees 
and/or  finding  the  skyline  in  natural  ground- 
level  scenes),  we  have  observed  that  we  can 
usually  find  very  dense  path  segments  that  are 
longer  than  some  minimal  length  (related  to  vi¬ 
sual  detection  criterion),  and  place  perceptual 
and/or  application-domain-related  constraints 
on  linking  possibilities  for  these  dense  segments. 
To  the  extent  that  most  of  the  path-points  are 
already  sequenced  as  members  of  the  detected 
segments,  and  it  is  only  the  segments  that  must 
be  sequenced,  and  even  here  there  are  only  a  few 
linking  alternatives  for  each  of  the  segments,  we 
can  solve  the  POP  problem  even  though  it  is 
formally  intractable. 

Our  overall-approach  then  is  [3][5]: 

(1)  assemble  the  potential  path-points  into 
dense  segments  by  using  a  fast  MST  algorithm 
(although  the  MST  does  not  actually  assure  the 
densest  connectivity,  it  usually  provides  a  very 
good  approximation  to  this  condition).  The  in¬ 
put  to  this  step  is  a  binarized  image;  the  output 
is  a  forest  of  (collection  of  disjoint)  MST’s. 

(2)  recover  the  longest  segments  -  consistent 
with  generic  perceptual  connectivity  criterion  - 
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that  can  be  extracted  from  the  forest  of  trees 
generated  in  step  (1).  (The  list  containing  these 
segments  is  called  RPATHS) . 

(3)  repartition  and  semantically  filter  the  col¬ 
lection  of  RPATHS  to  eliminate  perceptual  and 
semantic  linking  mistakes  and  irrelevant  paths 
introduced  or  retained  by  the  limited  flexibility 
of  the  MST  algorithm/representation  and  the 
generic  parsing  process. 

(4)  Use  a  very  general  linking  technique  and  rep¬ 
resentation  schema,  capable  of  expressing  arbi¬ 
trary  perceptual  and  semantic  constraints,  to 
imply  a  network  of  paths  that  is  very  likely  to 
include  the  POP. 

(5)  Parse  the  network  produced  in  (4)  to  extract 
a  relatively  small  collection  of  prominent  paths 
that  includes  the  POP. 

(6)  Rank  the  paths  extracted  in  (5),  using  an 
objective  function  based  on  the  primary  Gestalt 
criterion,  and  return  the  highest  ranked  path  as 
the  POP. 

In  the  next  section,  we  discuss  some  of  the  de¬ 
tails  of  how  the  Semantic  Delineation  Subsys¬ 
tem  (Figure  1)  accomplishes  steps  3  through  6. 

3  The  Semantic  Delineation 
Subsystem 

The  Semantic  Delineation  Subsystem  is  com¬ 
posed  of  two  major  components;  the  Semantic 
Filter  and  the  Semantic  Linker.  The  Semantic 
Linker,  in  turn,  has  three  main  functional  ele¬ 
ments:  (a)  the  SL-Segment-Linker,  (b)  the  SL- 
Path-Generator,  and  (c)  the  POP-Generator. 

3.1  The  Semantic  Filter  (SF) 

The  purpose  of  the  semantic  filter  is  to  extract, 
from  a  collection  of  perceptually  salient  paths, 
those  sub-paths  that  are  compatible  with  the 
constraints  of  some  specified  application  or  pur¬ 
pose  (e.g.,  sub-paths  that  could  be  road  seg¬ 
ments  in  an  aerial  image). 

This  system  component  takes  as  its  input  a  list 
of  generic  perceptually-salient  paths  (RPATHS) 
and  produces,  as  its  output,  a  list  of  path- 


segments  (RPATHS-F).  Each  item  (called  a 
seg)  in  RPATHS-F,  is  a  coherent  sub-path  of 
some  path  in  RPATHS;  the  segs  returned  in 
RPATHS-F  are  open  and  non-self-intersecting, 
and  any  pair  of  segs  are  disjoint  with  the  pos¬ 
sible  exception  of  a  single  intersection-point  (as 
are  the  paths  in  RPATHS). 

The  SF  processes  each  path  in  RPATHS  inde¬ 
pendently.  It  first  partitions  the  path  into  ad¬ 
jacent  segs  at  it’s  salient  points  using  the  algo¬ 
rithm  described  in  [4].  This  partitioning  step  is 
necessary  to  recover  components  of  the  appli¬ 
cation  relevant  paths  that  were  combined  with 
other  (incidental)  adjacent  paths  in  the  original 
image.  Each  seg  is  evaluated  for  compatibility 
with  the  constraints  of  the  intended  application 
on  an  accept  or  reject  basis.  The  accepted  segs 
are  appended  to  the  output-list  RPATHS-F.  In 
addition,  if  two  accepted  segs  were  part  of  the 
same  input  (RPATHS)  path,  but  are  now  sepa¬ 
rated  in  the  sense  that  some  portion  of  the  in¬ 
put  path  between  them  was  deleted  by  the  filter, 
then  an  entry  recording  this  fact  is  made  on  a 
link-list  (see  discussion  of  the  Semantic  Linker). 

While  the  SF  might  have  to  be  completely 
redesigned  for  each  new  application,  we  have 
found  that  the  same  set  attributes  (properly  pa¬ 
rameterized  for  the  different  applications)  ap¬ 
pears  adequate  for  such  diverse  tasks  as  finding 
roads  or  rivers  in  aerial  images,  and  for  find¬ 
ing  man-made  objects  (e.g.,  building  edges)  or 
natural  objects  (e.g.,  the  skyline,  tree-trunks) 
in  ground-level  images. 

The  attributes  we  currently  evaluate  (to  be  de¬ 
scribed  in  detail  in  a  later  version  of  this  pa¬ 
per)  are  concerned  with  length,  directionality, 
smoothness,  and  degree  of  randomness; 

(1)  Length.  Very  short  segs  are  typically  re¬ 
jected  as  being  "noise”  or  unimportant  (they 
can  be  recovered  later  if  necessary);  very  long 
segments  are  typically  accepted  since  they  are 
too  important  to  discard  without  the  further 
analysis  to  be  performed  later. 

(2)  Consistency  of  global  direction  based  on  a 
histogram  of  the  directions  between  adjacent 
seg  pixels  obtained  from  a  chain-coded  repre¬ 
sentation  of  the  seg. 
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(3)  Smoothness.  This  property  is  measured  in 
two  ways.  First,  each  seg  is  inherently  smooth 
to  some  degree  because  its  parent  in  RPATHS 
was  partitioned  into  segments  at  salient  (or  high 
curvature)  points.  Thus,  the  length  of  the  seg 
is  an  indirect  measure  of  its  smoothness  (the 
longer  the  seg,  the  smoother  it  is).  Second,  we 
measure  the  seg’s  deviation  from  a  best  fitting 
circular-arc  to  look  for  a  smoothness  property 
that  is  especially  important  for  some  applica¬ 
tions  (e.g.,  finding  man-made  objects). 

(4)  Randomness.  We  have  devised  a  weak  mea¬ 
sure  of  symmetry,  or  of  repeated  structure,  in 
a  path;  this  measure  together  with  the  evalu¬ 
ation  of  coherent  length,  consistent  direction, 
and  smoothness,  provide  a  basis  for  judging 
whether  a  seg  is  a  ’’purposeful”  or  an  appar¬ 
ently  random  structure. 

An  example  of  the  performance  of  a  semantic 
filter  we  designed  for  delineating  roads  in  aerial 
images  is  shown  in  figures  2d  and  3d.  Tables 
1  and  2,  in  the  section  on  experimental  evalua¬ 
tion,  presents  quantitative  results  of  the  filtering 
operation  in  terms  of  a  relevant  set  of  semantic 
categories. 

3.2  The  Semantic  Linker  (SL) 

The  purpose  of  the  semantic  linker  is  to  combine 
all  the  segs  in  the  list  RPATHS-F  (produced  by 
the  Semantic  Filter)  into  either  a  network  of 
partitioned  or  unpartitioned  paths,  or  to  select 
and  sequence  a  subset  of  the  segs  in  RPATHS- 
F  into  a  single  POP;  the  problem  of  producing 
an  unpartitioned  network  (generally,  the  more 
useful  of  the  available  types  of  output  since  a 
distinguished  POP  might  not  even  exist)  is  a 
very  simple  sub-problem  of  producing  a  POP. 

The  SL  has  three  components,  (a)  the  SL- 
Segment-Linker,  (b)  the  SL-Path-Generator, 
and  (c)  the  POP-Generator. 

3.2.1  The  SL-Segment-Linker 
(SLSL) 

The  input  to  SLSL  is  RPATHS-F,  and  its  out¬ 
put  is  the  ” link-pair-list.”  The  SLSL  examines 
every  pair  of  segs  in  RPATHS-F  and  determines 


if  they  can  be  adjacent  components  of  an  ex¬ 
tended  path  compatible  with  the  constraints  of 
the  specified  application.  If  so,  it  generates 
a  ’’link-pair”  entry  which  is  appended  to  the 
”link-pair-list.” 

The  SLSL  typically  uses  three  types  of  criteria 
to  make  a  link  decision  for  a  pair  of  segs: 

(1)  The  relative  geometric  positioning  and  sep¬ 
aration  of  the  segs.  For  example,  in  the  case 
of  road  delineation,  the  criterion  is  typically  a 
bound  on  the  separation-distance  between  nom¬ 
inally  corresponding  endpoints  (one  on  each 
seg).  In  the  case  of  skyline  delineation,  the  segs 
might  be  further  constrained  not  to  have  any 
overlap  in  their  horizontal  (x)  coordinates. 

(2)  Global  attributes  of  the  segs.  For  example, 
in  the  case  of  road  delineation  we  might  require 
that  the  spectral  distribution,  or  image  inten¬ 
sity,  or  mean  width  of  the  two  candidate  segs 
be  identical  to  within  some  specified  tolerance. 

(3)  Acceptance  by  the  semantic  filter.  If  the 
two  candidate  segs  are  linked  as  proposed  and 
treated  as  a  single  seg,  a  sufficient  condition  for 
linking  is  that  the  combination  is  accepted  by 
the  semantic  filter. 

3.2.2  The  SL-Path-Generator 
(SLPG) 

The  input  to  the  SLPG  is  the  ” link-pair-list” 
produced  by  the  SLSL  and  augmented  by  addi¬ 
tional  link-pairs  supplied  by  the  Semantic  Fil¬ 
ter;  the  output  is  either  an  unsegmented  net¬ 
work  (actually,  a  disjoint  collection  of  such  net¬ 
works)  or  a  pair  of  lists  containing  all  possible 
maximal  open-paths  and  loops  implied  by  the 
link-pairs.  The  POP  is  assumed  to  be  one  of 
these  (explicit)  paths,  and  a  simple  test  is  pro¬ 
posed  as  a  way  of  selecting  it. 

The  function  of  the  SLPG  is  purely  syntac¬ 
tic/algorithmic  -  to  expand  the  path  informa¬ 
tion  implicit  in  the  augmented  link-pair-list. 
The  link-pairs  are  a  compact  encoding,  actu¬ 
ally  generators,  of  the  network  or  collection  of 
paths  to  be  produced  by  the  SLPG. 

We  can  easily  partition  a  collection  of  link- 
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pairs  into  disjoint  subsets  so  that  every  pair 
of  link-pairs  referring  to  a  common  seg  are  in 
the  same  subset  called  a  ”link-pairs-association- 
set.”  The  collection  segs  corresponding  to  such 
a  subset  is  called  a  ” seg- association-set.  Each 
seg-association-set  implies  a  disjoint  network  of 
paths  (e.g.,  roads);  networks  consisting  of  a  few 
short  isolated  paths  can  often  be  discarded  as 
noise.  The  larger  networks  are  typically  re¬ 
turned  as  one  of  the  major  end-products  of  the 
system  described  here  when  used  to  find  all  the 
salient  paths  (e.g.,  roads  or  rivers)  in  an  image. 
In  this  paper  we  are  primarily  concerned  with  a 
second  type  of  output:  explicitly  extracting  the 
single  most  salient  path  (the  POP). 

The  SLPG  operates  as  follows:  The  link-pairs 
are  first  partitioned  into  disjoint  subsets  (link- 
pairs-association-sets);  these  subsets  are  then 
processed  independently  to  extract  their  im¬ 
plied  paths  using  a  collection  of  algorithms. 
For  the  purposes  of  this  paper  we  describe  the 
LP-Basic-Path-Extension-Algorithm  (BEA)  in 
some  detail,  but  only  indicate  the  basis  for  the 
remainder  of  the  full  extraction  process  (see  Ap¬ 
pendix). 

A  maximal-path  through  an  LP-network  is  one 
that  cannot  be  a  proper  continuous  subsequence 
of  some  longer  path;  we  will  call  the  endpoints 
of  a  maximal-path  terminal-nodes.  A  loop  is  a 
maximal-path  that  begins  and  ends  on  the  same 
terminal-node.  One  requirement  of  the  SLPG  is 
to  explicitly  list  all  maximal-paths. 

If  the  BEA  is  given  a  terminal-node  as  a  seed,  it 
will  iteratively  generate  all  the  maximal-paths 
that  have  the  given  terminal-node  as  (at  least) 
one  of  their  endpoints.  It  is  both  fast  and  sim¬ 
ple  to  find  all  the  free  endpoints  (nodes  of  de¬ 
gree  one)  of  an  LP-network  given  its  associated 
list  of  link-pairs;  each  such  free  endpoint  (called 
an  ept)  is  a  terminal-node  of  one  or  more  of 
the  maximal-paths.  In  an  LP-network  without 
loops,  we  can  generate  all  the  maximal-paths 
using  the  set  of  epts  as  seeds.  Each  maximal- 
path  will  be  found  twice,  but  this  redundancy 
does  not  cause  any  problems.  The  redundant 
paths  can  be  avoided  at  considerable  additional 
complexity  in  the  BEA  algorithm,  but  it  is  sim¬ 
pler  to  just  detect  and  delete  them  should  this 


be  necessary. 

If  the  network  contains  loops,  except  for  some 
unusual  situations,  the  above  procedure  will  still 
return  all  the  maximal  paths  (including  the 
loops).  Each  loop  could  be  generated  many 
times  (an  upper  bound  is  the  product  of  the 
number  of  epts  and  ^^entry-points  to  the  given 
loop).  If  we  wish  to  be  assured  that  all  the 
maximal-paths  are  found,  and  also  reduce  the 
redundant  discovery  of  the  same  loop,  then  we 
can  proceed  as  above  (if  there  are  any  epts,  oth¬ 
erwise,  pick  any  node  as  the  first  seed  and  later 
discard  the  initial  set  of  non-maximal-paths). 
All  terminal-nodes  which  are  not  already  mem¬ 
bers  of  the  list  of  seeds  are  added  to  that  list 
whenever  they  are  found.  When  a  loop  is  re¬ 
turned  by  the  BEA,  we  have  to  modify  all  the 
link-pairs  that  point  to  segs  that  are  compo¬ 
nents  of  the  loop.  We  inactivate  all  link-pairs 
that  point  to  two  loop-segs,  and  replace  each 
link-pair  that  points  to  exactly  one  loop-seg 
with  two  new  link-pairs;  in  one  case  the  orig¬ 
inal  loop-seg-link-atom  is  replaced  by  a  link- 
atom  that  will  insert  into  a  non-terminating 
path  that  originally  included  one  or  more  loop- 
segs  a  dummy-seg  identifying  the  loop;  in  the 
second  case  the  original  loop-seg-link-atom  is  re¬ 
placed  by  a  link-atom  that  identifies  itself  as  a 
terminal-node  associated  with  the  given  loop. 
In  a  sense,  we  collapse  the  loops  in  the  origi¬ 
nal  LP-network  and  create  a  modified  loop-free 
network  in  which  the  BEA  (algorithm)  is  as¬ 
sured  to  return  all  the  maximal-paths.  Those 
returned  maximal-paths  that  contain  dummy- 
segs  are  easily  rectified. 

In  summary,  there  are  some  interesting  theoreti¬ 
cal  issues  that  must  be  addressed  in  order  to  un¬ 
derstand  how  to  make  the  SLPG  more  efficient, 
but  the  algorithm  we  have  described  is  com¬ 
putationally  acceptable,  and  it  returns  ail  the 
maximal-paths  as  required  to  allow  the  SLPG 
to  correctly  perform  its  function. 

3.2.3  The  POP-Generator 

The  list  of  maximal-paths  (both  open-paths  and 
loops)  returned  by  SLSL  is  assumed  to  con¬ 
tain  the  POP.  The  segments  comprising  these 
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paths  have  been  previously  filtered  to  assure 
compatibility  with  the  semantic  constraints  of 
the  specified  problem  domain,  and  are  percep¬ 
tually  salient  with  respect  to  (at  least  some  of) 
the  Gestalt  laws  of  perceptual  organization.  In 
the  present  algorithm,  the  POP-generator  does 
not  alter  any  of  the  maximal  paths  but  simply 
selects  the  one  that  maximizes  a  combination 
of  both  path-density  and  path-length.  Actually, 
the  product  of  path-length  and  [path- density)^ 
where  path-density  is  measured  by  (number-of- 
path-points)  /  (path-length) . 

4  Experimental  Evaluation 

In  addition  to  a  significant  amount  of  previ¬ 
ous  informal  testing  and  evaluation  (some  parts 
of  the  Linear  Delineation  System  were  applied 
to  well  over  1,000  images  of  different  types 
and  with  different  delineation  goals),  we  are 
now  engaged  in  developing  a  formal  evaluation 
methodology,  especially  in  regard  to  road  delin¬ 
eation. 

In  a  typical  road  delineation  problem  (Figure 
2)  the  Delineation  System  was  invoked  with¬ 
out  any  manual  intervention  or  parameter  tun¬ 
ing.  We  started  with  a  768X638  pixel  image 
(489,984  points)  that  resulted  in  a  binarized  ver¬ 
sion  (step  1)  with  55,480  potential  road  points 
(Fig  2b).  As  a  result  of  the  generic  delin¬ 
eation  process  (step  2),  we  extracted  340  seg¬ 
ments  (RPATHS)  containing  21,255  points  (Fig 
2c).  We  defined  six  semantic  categories  of  in¬ 
terest  (Narrow  Road,  Wide  Road,  Proto  Road, 
Ambiguous,  Background,  River)  and  manually 
classified  the  pixels  along  the  paths  into  these 
six  categories.  If  the  labeling  of  a  given  Rpath 
was  mixed,  we  counted  the  contiguous  segments 
with  the  same  label  as  being  distinct  -  thus  we 
judged  that  there  really  were  375  semantically 
distinct  segments  containing  21170  points  com¬ 
prising  the  340  actual  RPATHS  with  an  asso¬ 
ciated  count  of  21,255  pixels.  (Because  of  dou¬ 
ble  counting  of  segment  and  path  intersection- 
points,  there  is  a  small  discrepancy  in  the  num¬ 
ber  of  points  in  the  actual  Rpaths  and  in  the 
semantically  labeled  segments). 

Table  1  and  Fig  2d  show  the  effectiveness 


of  the  Semantic  Road  Filter  in  retaining 
road  points/segments  while  eliminating  the  un¬ 
wanted  background  and  river  points/segments. 
Since  the  Road  Filter  was  designed  to  retain 
narrow  road  segments,  other  structures  (wide- 
roads,  proto-roads,  ambiguous)  that  could  pos¬ 
sibly  be  roads  were  considered  to  have  a  “don’t- 
care”  status  in  our  evaluation. 

A  “window”  (Figure  3)  was  manually  selected 
and  extracted  from  Figure  2  to  test  the  POP- 
delineation  algorithm.  Here  we  started  with 
a  475X149  pixel  image  (70775  points)  that  re¬ 
sulted  in  a  binarized  version  (step  1)  with  7214 
potential  road  points  (Fig  3c).  The  extracted 
set  of  23  RPATHS  contained  3696  points  (Fig 
3d).  Table  2  and  Fig  3e  show  the  result  of  ap¬ 
plying  the  Semantic  Road  Filter;  it  returned  37 
segments  containing  3117  points  in  RPATHS-F 
and  22  link-pairs  (in  *aux-link-pair-list*).*  The 
SL-Segment-Linker  produced  19  additional  link- 
pairs,  and  thus  a  total  of  41  distinct  link-pairs 
were  supplied  to  the  SL-Path-Generator;  these 
were  classified  as  consisting  of  6  ept-pairs,  34 
interior  pairs,  and  1  closed  pair.  The  SLPG  re¬ 
turned  43  open  paths  and  115  paths  containing 
loops;  a  total  of  158  maximal-paths.  This  set 
of  paths  contained  redundant  entries;  actually, 
there  were  8  distinct  open-paths  and  4  distinct 
closed-paths  (loops).  Each  path  was  assigned 
a  ranking  using  the  the  product  of  path-length 
and  [path- density)'^  metric  presented  in  the  pre- 
ceeding  section.  The  POP-generator  then  se¬ 
lected  the  highest  ranking  path  (it  happend  to 
be  one  of  the  closed-paths)  as  the  POP  (Fig  3f). 
This  was  the  desired  delineation. 

5  Discussion 

The  work  described  in  this  paper  is  part  of 
an  on-going  effort  to  fully  automate  the  pro¬ 
cess  of  delineating  perceptually  and/or  seman¬ 
tically  meaningful  line-like  structures  appearing 
in  both  aerial  and  ground-level  images  of  scenes 
consisting  mostly  of  natural  features  (e.g.,  trees, 
vegetation,  drainage,  and  terrain)  as  well  as 
some  man  made  objects  (especially  roads).  Our 
intent  in  preparing  this  paper  was,  in  addition 
to  its  nominal  subject  matter,  to  describe  rele¬ 
vant  components  of  the  system  being  assembled 
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RPATHS 

RPATHS-F2 

Category 

#  points 

%  points 

paths 

#  points 

%  points 

#  paths 

Narrow  Road 

5843 

28 

45 

5322 

46 

40 

Wide  Road 

2102 

10 

12 

2082 

18 

12 

Proto  Road 

2941 

14 

62 

1671 

15 

46 

Ambiguous 

1579 

7 

37 

425 

4 

19 

Background 

8192 

39  ^ 

210 

1720 

15 

87 

River 

513 

2 

9 

294 

3 

5 

Total 

21170 

100 

375 

11514 

100 

209 

Table  1:  Categorized  Delineations  for  FT-HOODl  image.  Total  pixels  =  768  x  638  =  489984 


Wl-RPATHS 

W1-RPATHS-F2 

Category 

7(1!:  points 

%  points 

#  paths 

7(1  points 

%  points 

^  paths 

Narrow  Road 

3060 

85 

8 

2935 

94 

7 

Wide  Road 

0 

0 

0 

0 

0 

0 

Proto  Road 

63 

2 

1 

46 

1 

1 

Ambiguous 

146 

4 

3 

46 

1 

3 

Background 

319 

9 

10 

88 

3 

6 

River 

0 

0 

0 

0 

0 

0 

Total 

3588 

100 

22 

3115 

100 

17 

Table  2:  Categorized  Delineations  for  FT-HOODl- W1  image.  Total  pixels  —  475  x  149  70775 


for  this  purpose,  illustrate  and  quantify  some  of 
its  current  performance,  and  discuss  some  as¬ 
pects  of  the  conceptual  basis  for  its  design. 

The  problem  of  finding  the  POP  in  (some  des¬ 
ignated  portion  of)  an  image  is  a  basic  require¬ 
ment  for  effective  man-machine  communication 
about  images,  as  well  as  a  challenging  problem 
whose  solution  is  required  to  accomplish  some 
of  the  more  general  delineation  tasks.  In  this 
paper,  we  provide  an  approach  to  the  solution 
of  this  problem,  and  an  algorithm  that  is  appli¬ 
cable  to  a  limited  class  of  scene  domains.  The 
algorithm  has  performed  well  on  a  small  set  of 
test  cases  but  a  significant  amount  of  additional 
testing  will  be  required  before  be  can  be  sure  of 
its  utility  and  robustness. 

An  important  contribution  of  this  paper  is 
the  introduction  of  the  LP-representation  (link- 
pair/LP-network)  and  associated  machinery  as 
a  generalization  of  the  conventional  graph.  The 
LP-network  provides  a  very  powerful  way  of 
dealing  with  linear  structures;  it  provides  al¬ 


most  complete  generality  in  specifying  connec¬ 
tivity  (more  than  is  possible  with  a  graph),  it 
provides  a  very  compact  description  of  the  (im¬ 
plied)  connected  structures,  and  it  admits  rea¬ 
sonable  algorithms  for  the  common  (relatively 
simple)  situations  to  be  expected  in  images  of 
real  scenes.  On  the  other  hand,  because  there 
are  specializations  of  the  linking  problem  that 
are  NP-hard,  there  are  no  generally  efficient  al¬ 
gorithms  for  this  purpose. 

There  are  many  open  problems  and  obvious  ex¬ 
tensions  of  the  work  discussed  in  this  paper. 
However,  one  of  the  more  interesting  extensions 
would  be  to  find  a  way  to  duplicate  human  per¬ 
formance  in  the  following  type  of  situation: 

Consider  an  image  composed  of  a  sequence 
of  50  equal  signs  typed  in  a  row  (i.e.. 

Also  assume  there  is  a  solid  horizontal  line 
positioned  just  below  the  equal  signs  that  has 
the  same  horizontal  extent.  If  we  assume  that 
there  are  four  links  possible  between  each  pair 
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of  successive  equal  signs  (two  straight  links  and 
two  cross-over  links;  we  ignore  the  vertical  links 
which  lead  to  short  closed  paths),  then  there 
are  on  the  order  of  paths  that  potentially 
would  have  to  be  generated  before  we  could 
decide  -  by  applying  some  objective  function  to 
an  explicit  descriptions  of  the  competing  paths 
-  if  some  path  through  the  equal  signs,  or  the 
solid  line,  or  neither,  was  the  POP.  Obviously, 
the  human  doesn’t  do  this;  he  picks  the  solid 
line  almost  immediately;  how  is  he  able  to 
avoid  the  combinatorial  explosion?? 

One  of  our  main  points  in  this  paper,  and  the 
basis  of  our  approach,  was  that  much  of  the  as¬ 
sembly  of  the  ultimately  to  be  selected  POP 
had  to  take  place  in  the  generic  perception 
phase  which  sacrifices  flexibility  and  generality 
for  simplicity  and  speed.  The  Semantic  Linker 
is  computationally  limited  and  can’t  be  handed 
a  problem  with  too  many  choices.  Thus,  the 
overall  delineation  system  must  include  mech¬ 
anisms  that  enforce  complexity  constraints  on 
the  output  of  each  of  the  its  subsystems  -  this 
type  of  control  could  be  accomplished  by  itera¬ 
tively  adjusting  algorithm  parameters.  It  thus 
appears  that  a  vision  system  must  be  fully  cog¬ 
nizant  of  its  computational  limitations  if  it  is  to 
operate  effectively.  Understanding  how  to  ac¬ 
complish  this  type  of  control  is  one  of  our  more 
immediate  goals. 

A  Appendix 
A.l  Definitions 

EPT:  a  ’’free  endpoint”  designates  the  end  of 
a  seg  which  is  not  referenced  by  any  currently 
active  link-pair;  e.g.,  a  path  containing  an  ept 
cannot  be  further  extended  at  that  end. 

LINK-ATOM:  a  list  of  two  items,  the  first  is  an 
index  number  into  RPATHS-F;  i.e.,  it  points  to 
a  seg  in  RPATHS-F.  The  second  item  is  a  logical 
variable  (T  or  NIL)  which  specifies  whether  the 
seg  is  to  be  used  as  stored  (NIL)  or  reversed 
(T). 

LINK-LIST:  a  list  of  two  or  more  link-atoms.  It 
specifies  how  to  assemble  a  path  from  a  subset 
of  the  segs  stored  in  RPATHS-F. 


LINK-PAIR:  a  list  of  two  link-atoms.  It  spec¬ 
ifies  a  path  consisting  of  the  concatenation  of 
the  two  segs  in  the  order  listed,  with  the  points 
in  each  seg  taken  as  stored  in  RPATHS-F,  or 
reversed,  as  specified  by  the  logical  variables. 

LINK- PAIR-LIST:  nominally,  the  list  of  link- 
pairs  produced  by  the  SL-Segment-Linker  and 
the  Semantic  Filter. 

LOOP:  a  subsequence  of  a  path  that  begins  and 
ends  with  an  identical  link  atom,  or,  a  subse¬ 
quence  of  a  path  that  begins  and  ends  with  a 
link  atoms  pointing  to  the  same  seg,  but  having 
reversed  directions. 

LP-NETWORK:  the  collection  of  paths  implied 
by  a  collection  of  link  pairs. 

CONNECTED-LP-NETWORK  (CLPN):  a  col¬ 
lection  of  link-pairs  can  be  partitioned  into  dis¬ 
joint  subsets  so  that  every  pair  of  link-pairs  re¬ 
ferring  to  a  common  seg  are  in  the  same  subset 
called  a  ’’LINK-PAIRS- ASSOCIATION-SET.” 
The  collection  segs  corresponding  to  such  a 
subset  is  called  a  ’’SEG-ASSOCIATION-SET.” 
Each  seg-association-set  implies  a  disjoint  net¬ 
work  of  paths  called  a  CLPN. 

PATH:  a  concatenation  segs  (segments)  as  spec¬ 
ified  by  a  link-list.  No  seg  can  appear  more  than 
once  -  with  the  exception  of  the  seg  specified 
by  the  head  link-atom  in  the  case  of  a  loop  or 
semi-loop.  The  HEAD  of  the  path  is  intended 
to  refer  to  the  end  at  which  the  path  is  being 
extended;  the  TAIL  of  the  path  is  intended  to 
refer  to  the  end  of  the  path  containing  the  seed 
link-atom,  and  we  arbitrarily  assume  that  the 
path  is,  or  was,  constructed  by  a  sequential  ac¬ 
cumulation  of  segs  starting  at  the  tail-end.  For 
most  purposes,  we  further  restrict  this  defini¬ 
tion  to  prohibit  the  path  from  visiting  a  vertex 
more  than  once. 

OPEN-PATH:  a  path  that  does  not  contain  a 
(complete)  loop. 

MAXIMAL-PATH:  A  maximal-path  through 
an  LP-network  is  one  that  cannot  be  a  proper 
continuous  subsequence  of  some  longer  path; 
we  will  call  the  endpoints  of  a  maximal-path 
TERMINAL-NODES.  A  LOOP  is  a  maximal- 
path  that  begins  and  ends  on  the  same  terminal- 
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node. 


A. 2  Some  Attributes  of  an 
LP-network 

1.  If  an  LP-network  has  no  loops,  then  the 
terminal-nodes  of  every  maximal-path  are 
epts;  i.e.,  every  fully  extended  path  will  be¬ 
gin  and  end  at  an  endpoint  of  a  seg  which 
is  not  connected  (by  an  active  link-pair) 
to  any  other  seg.  If  the  LP-network  does 
contain  loops,  then  the  “entry-points’  to 
the  loops  will  also  be  terminal-nodes  of 
maximal-paths. 

2.  There  might  not  be  any  single  path  con¬ 
necting  two  given  epts,  even  in  a  connected- 
component  of  the  network.  For  example, 
consider  a  network  consisting  of  three  segs 
connected  as  a  Y.  If  the  two  upper  arms 
of  the  Y  both  connect  to  the  lower  vertical 
stem,  but  not  to  each  other,  then  there  is 
no  direct  path  linking  the  two  free  ends  of 
the  upper  segs  of  the  Y. 

3.  An  LP-network  can  always  be  converted 
into  a  conventional  graph  by  adding  addi¬ 
tional  link-pairs  so  as  to  make  every  vertex 
“fully-connected.” 

A. 3  Basic  Algorithms 

Key  algorithms  include: 

•  Ip-basic-path-extension-algorithm 

•  partition-lp-network-into-connected- 
subnetworks 

•  identify-ept-lp-algorithm 


represented  by  a  link-list  (a  list  of  link-atoms) ; 
all  the  partial  paths  we  are  currently  extending 
have  one  endpoint  defined  by  the  initial  seed  si. 
Consider  a  particular  path  (say  p7)  which  has 
as  its  current  terminal  link-atom  the  list  (seg24 
t).  If  some  link-pair  in  q  has  the  form  ((seg24 
t)  (segX  t/nil)),  then  p7  can  be  extended  by 
appending  link-atom  (segX  t/nil)  to  its  current 
link-list.  There  may  be  more  than  one  link- 
pair  in  q  capable  of  extending  the  current  ver¬ 
sion  of  p7.  Further,  if  some  link-pair  in  q  has 
the  form  ((segZ  nil/t)  (seg24  nil)),  this  link-pair 
representing  the  concatenation  of  segs  segZ  and 
seg24,  is  equivalent  (in  that  the  two  segs  are 
joined  in  the  same  way)  to  the  link-pair  rep¬ 
resented  by  ((seg24  t)  (segZ  t/nil)),  and  thus 
p7  could  be  extended  by  appending  link-atom 
(segZ  t/nil). 

At  each  iteration  we  start  with  a  list  of  partial 
paths,  and  for  each  such  path  there  are  three 
possibilities:  (a)  there  are  no  further  extensions, 
in  which  case  the  path  is  placed  on  the  out¬ 
put  open-path-list;  (b)  there  is  an  extension, 
but  the  extending  seg  (or  the  unattached  ver¬ 
tex  of  the  seg)  already  appears  in  the  partial 
path,  in  which  case  the  extended  path  is  placed 
on  the  output  closed-path-list;  (c)  there  are  one 
or  more  (single  seg)  extensions  with  new  segs, 
in  this  case,  all  such  extensions  are  placed  on 
a  new  partial-path-list  and  the  above  process 
is  repeated.  The  process  terminates  when  the 
partial-path-list  has  no  entries  at  the  start  of  a 
new  iteration. 


A.3.1  LP-basic-path-extension- 
ALGORITHM 

For  a  given  subset  of  link-pairs  (say  set  q),  a 
link-atom  (say  si)  is  selected  from  one  of  the 
link-pairs  in  q  as  a  seed,  and  a  collection  of 
paths  are  iteratively  constructed  from  this  seed 
by  successively  scanning  all  the  link-pairs  (in  q) 
for  additional  segs  to  append  to  the  set  of  paths 
still  being  extended.  We  note  that  a  path  is 
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Semantic  Filtering 
and  Partitioning 


The  Detector/Binarizer  System  (DBS) 


(b)  FT-HOODl-Wl  (c)  Wl-MASK 
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Abstract 

Image  understanding  (lU)  techniques  for  auto¬ 
matic  site  reconstruction  have  demonstrated  suc¬ 
cess  within  restricted  domains  and  for  small  num¬ 
bers  of  model  classes.  However,  these  techniques 
often  fail  when  applied  out  of  context  and  do  not 
“scale-up”  into  a  more  general  solution.  Under  the 
APGD  program,  we  are  constructing  a  knowledge- 
based  site  reconstruction  system  that  automatically 
selects  the  correct  algorithm  according  to  the  cur¬ 
rent  context,  applies  it  to  a  focused  subset  of  the 
data,  and  constrains  the  interpretation  of  the  result 
through  the  explicit  use  of  knowledge. 

1  Introduction 

The  extraction  and  reconstruction  of  building  mod¬ 
els  from  aerial  images  has  become  an  important  area 
of  research  in  recent  years.  Significant  progress  has 
been  made  and  several  systems  perform  reasonably 
well  within  their  appropriate  domains  [Collins’95, 
Herman’94,  Lin  et  al.’94,  Chellapa  et  al.’94].  For 
example,  recent  testing  of  the  Ascender  I  system 
has  shown  it  capable  of  automatically  extracting 
a  large  percentage  of  the  buildings  within  a  sub- 
region  of  the  Fort  Hood  dataset  [Collins,  et  al’96]. 
Although  these  results  are  significant,  the  system 
was  designed  to  perform  well  under  particular  con¬ 
texts  and  is  only  capable  of  detecting  the  single  class 
of  buildings  whose  rooftops  are  fiat  rectilinear  poly¬ 
gons. 

The  modest  successes  attained  by  Ascender  I  and 
similar  systems  can,  we  believe,  be  traced  to  their 
narrow  scope  and  application  to  highly  constrained 
data.  The  class  of  fiat  roofed  rectilinear  buildings  is 

’Funded  by  the  RADIUS  project  under  DARPA/Army 
TEC  contract  number  DACA76-92-C-0041,  by  NSF  grant 
number  CDA-8922572,  and  in  part  by  the  National  Coun¬ 
cil  for  Scientific  Research-Brazil  grant  number  260185/92.2. 


very  clearly  defined  by  a  set  of  geometric  and  spatial 
properties  which  are  useful  for  recognition  if  the  in¬ 
cidence  of  distracting  classes  is  small  (that  is,  when 
the  data  is  suitably  constrained) .  This  idea  of  a  (set 
of)  local  expert  (s)  for  recognizing  instances  of  an 
object  class  played  a  prominent  role  in  early  work  on 
the  Schema  system  [Draper ’89],  as  well  as  other  sys¬ 
tems  in  the  aerial  image  domain  [Chellapa  et  al.’94, 
Huertas  and  Nevatia’80,  CiflTord  and  McKeown’94, 
Jaynes’96a,  Matsuyama’85] .  Under  this  model,  ro¬ 
bustness  is  achieved  by  providing  multiple  recon¬ 
struction/recognition  strategies  which  are  applica¬ 
ble  under  well  defined  conditions  and  generality  is 
achieved  by  increasing  the  number  of  object  classes 
to  describe  a  larger  fraction  of  the  world. 

Work  has  begun  on  Ascender  II,  a  geometric  site 
modelling  system  based  on  this  general  framework. 
The  design  of  Ascender  H  is  founded  on  three  basic 
principles:  1)  Specific  image  understanding  strate¬ 
gies  are  clearly  successful  under  particular  contexts 
for  a  particular  class  of  objects  but  may  break 
down  when  applied  in  contexts  that  exceed  the 
design  constraints.  2)  Domain  knowledge,  knowl¬ 
edge  acquired  from  partial  processing  of  the  data, 
and  knowledge  about  available  image  understand¬ 
ing  strategies  are  all  valuable  in  constraining  the 
reconstruction  problem.  3)  A  successful  system  will 
contain  many  specific  strategies  but  will  selectively 
apply  them  in  the  correct  context,  with  the  correct 
set  of  parameters,  and  will  fuse  the  results  of  indi¬ 
vidual  strategies  into  a  complete  reconstruction. 

2  Ascender  II 

The  Ascender  H  system  explicitly  represents  both 
knowledge  and  context  to  support  a  purposeful  re¬ 
construction  of  the  site  using  geometric  and  spatial 
reasoning  and  intelligent  control  of  sophisticated 
lU  algorithms.  The  system  is  divided  into  a  vi- 
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sual  subsystem  and  a  knowledge  base.  The  visual 
subsystem  resides  within  the  Radius  Common  De¬ 
velopment  Environment  (RCDE)  [Mundy  et  al.’92] 
and  contains  a  library  of  lU  algorithms,  a  ge¬ 
ometric  database  that  contains  available  data 
(images,  line  segments,  functional  classifications, 
etc.),  as  well  as  models  that  may  have  been  ac¬ 
quired  through  processing.  The  knowledge  base  is 
based  on  belief  networks  and  is  constructed  using 
HUGIN  [Andersen’89],  a  system  for  designing  be¬ 
lief  networks  and  influence  diagrams.  The  knowl¬ 
edge  base  consists  of  reasoning  mechanisms,  a  con¬ 
trol  system,  and  the  belief  network  that  represents 
the  current  set  of  knowledge  about  the  site.  The 
two  systems  communicate  through  Unix  socket  IP 
mechanisms.  Figure  1  shows  an  overview  of  the  sys¬ 
tem. 

Reasoning  takes  place  over  regions  of  discourse  that 
represent  a  subset  of  the  available  data.  Regions  of 
discourse  may  be  image  regions,  a  particular  build¬ 
ing  model,  or  other  sets  of  data  that  may  have  been 
produced  by  the  system. 

Processing  of  the  data  proceeds  in  a  straightforward 
way.  First,  the  knowledge  base  is  consulted  and  an 
appropriate  lU  algorithm  and  subset  of  the  current 
region  are  selected.  For  example,  a  search  for  line 
evidence  along  the  center  of  a  building  region  may 
be  invoked  to  gather  evidence  for  the  presence  of  a 
peaked  roof  building.  The  choice  of  algorithm  and 
subset  of  the  data  is  sent  as  a  request  to  the  visual 
subsystem  for  processing.  The  lU  algorithm  is  ap¬ 
plied  to  the  data  and  the  database  is  updated  with 
the  result  (for  example,  a  set  of  line  features  may  be 
produced).  The  visual  subsystem  then  converts  the 
result  into  a  single  value  that  represents  the  belief 
that  the  requested  evidence  was  present.  This  be¬ 
lief  is  passed  to  the  knowledge  base  where  it  is  used 
to  update  the  belief  network.  The  next  appropriate 
action  is  then  selected  based  on  the  control  policy. 

2.1  Knowledge  Base 

The  knowledge  base  is  capable  of  representing  the 
current  context,  specific  site  knowledge  (either  en¬ 
gineered  or  acquired  as  part  of  processing),  and 
general  domain  knowledge  relevant  to  site  mod¬ 
elling.  Knowledge  is  stored  in  a  Schema  Net¬ 
work.  The  representation  is  a  combination  of  two 
important  ideas  drawn  from  the  field  of  Artificial 
Intelligence;  Schemas  [Draper’89]  and  Belief  Net¬ 
works  [Jensen’96].  The  network  encodes  informa- 
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Figure  1:  Ascender  II  system  overview.  Control  deci¬ 
sions  are  based  on  the  current  knowledge  about  the  site. 
Vision  algorithms,  stored  in  the  RCDE,  gather  evidence 
about  the  site  and  produce  a  site  model.  Ascender  I 
provides  one  set  of  lU  strategies  relevant  to  site  recon¬ 
struction. 


tion  about  how  and  when  algorithms  can  be  applied 
in  the  current  context  and  explicitly  represents  the 
causal  dependencies  found  in  a  particular  domain. 
Nodes  within  the  network  represent  discrete  vari¬ 
ables  that  are  associated  with  the  domain.  For 
example,  the  node  Building~Roof -Shape  may 
have  the  discrete  states  {f lat /  peaked,  curved, 
composite}.  At  each  node,  an  evidence  policy  con¬ 
tains  information  about  how  evidence  for  a  peaked 
roof  building  may  be  acquired.  Contextual  rules, 
part  of  the  node’s  evidence  policy,  assist  in  the  se¬ 
lection  of  the  correct  algorithm,  data,  and  parame¬ 
ters,  for  the  given  context. 

Edges  within  the  network  represent  a  conditional 
dependence  between  a  parent  and  child  node.  As¬ 
sociated  with  each  node  is  a  conditional  probabil- 
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ity  table  that  contains  a  probability  for  each  state 
of  the  node  given  the  values  of  the  parents.  The 
Belief  for  a  parent  node,  then,  can  be  computed 
from  the  values  of  the  children  using  causal  influ¬ 
ence  [Russel’QS].  As  evidence  is  added  to  the  net¬ 
work  (through  the  execution  of  an  evidence  policy), 
the  effect  is  propagated  throughout  the  network  and 
a  new  set  of  belief  values  are  computed. 

2.1.1  Action  Selection 

The  problem  of  action  selection  within  the 
schematic  network  is  a  significant  one.  Currently  we 
take  a  greedy  approach.  In  order  to  gather  evidence 
about  a  node  n,  the  corresponding  evidence  policy 
is  invoked.  If  the  evidence  policy  at  n  is  empty 
(there  are  no  lU  algorithms  directly  applicable  to 
computing  belief (n))  or  there  are  no  available  algo¬ 
rithms  for  the  current  context,  then  the  children  of 
n  are  visited.  The  node  whose  belief  value  contains 
the  highest  uncertainty  is  selected  and  either  its  ev¬ 
idence  policy  is  invoked  or  its  children  are  visited. 
Once  evidence  has  been  computed,  the  belief  val¬ 
ues  are  propagated  back  through  the  network  and  a 
new  action  is  selected.  This  implies  that  there  must 
be  at  least  one  evidence  policy  available  at  each  of 
the  leaf  nodes  within  the  network. 

Certainty  for  node  n  is  defined  as  difference  of  the 
maximum  belief  and  the  belief  value  if  all  states  at 
n  are  equally  likely. 

max{Belief{n)  - 

2.2  Visual  Subsystem 

The  visual  subsystem  is  comprised  of  two  parts; 
a  function  library  that  stores  the  set  of  lU  algo¬ 
rithms  available  to  the  system,  and  a  geometric 
database  that  contains  available  data  in  the  form 
of  imagery,  partial  models,  and  other  collateral  in¬ 
formation  about  the  site  (such  as  classification  of 
functional  areas). 

The  library  of  Ascender  II  algorithms  must  address 
aspects  of  the  site  reconstruction  problem.  For  ex¬ 
ample,  finding  regions  that  may  contain  buildings, 
classifying  building  rooftop  shapes,  and  determin¬ 
ing  the  position  of  other  cultural  features,  are  all 
important  tasks  for  the  Ascender  II  system.  Many 
of  the  lU  algorithms  may  be  very  “lightweight”  and 
are  expected  to  perform  only  in  a  constrained  top- 
down  manner.  This  is  due  to  the  fact  that  the  lU  al¬ 
gorithms  are  responsible  for  gathering  evidence  for 
a  particular  hypothesis  put  forward  by  the  knowl¬ 


edge  base.  For  example,  an  algorithm  that  detects 
the  presence  of  local  maximum  in  a  region  of  the  el¬ 
evation  data  can  be  viewed  as  a  car  detector  when 
invoked  on  a  parking  lot  area.  The  same  algorithm 
may  detect  the  presence  of  a  rooftop  structure  when 
applied  to  a  known  building  area. 

Algorithms  may  also  be  very  sophisticated,  such  as 
the  reconstruction  of  flat  roof  buildings  from  multi¬ 
ple  views  (the  role  of  the  Ascender  I  system).  Be¬ 
low,  several  of  the  more  complex  algorithms  are 
briefly  described  in  order  to  demonstrate  the  type 
of  algorithms  made  available  to  the  system. 

2D  Polygon  Detection  [Jaynes’94]:  Search  optical 
image  for  polygons  that  represent  high  confidence 
rooftop  boundaries.  Lines  and  corners  are  extracted 
from  the  image  and  grouped  into  perceptually  com¬ 
patible  chains.  A  search  of  all  possible  groupings  re¬ 
turns  the  maximal  independent  set  of  closed  chains. 


2.5D  Feature  Grouping:  Match  image  features 
across  multiple  images  to  compute  heights  and 
group  based  on  height/shape  constraints.  For  ex¬ 
ample,  compute  line  heights  through  a  multi-image 
matching  scheme.  Group  the  line  segments  into  sets 
of  two  parallel  lines  at  the  same  height  with  a  third, 
higher  parallel  line  into  regions  that  may  indicate 
the  presence  of  a  peaked  roof  building. 
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Local  Shadow  Analysis  [Lin  et  al.’94]:  The  known 
sun  angle  and  building  model  constrains  the  search 
for  a  corresponding  shadow  in  the  image.  The  shape 
of  the  shadow  can  be  analyzed  to  infer  the  shape  of 
the  building  rooftop  that  cast  it. 


Automatic  Model  Indexing  [Jaynes’97]:  Match  a 
region  of  an  elevation  image  with  a  surface  primitive 
database.  This  is  accomplished  through  a  construc¬ 
tion  of  the  extended  Gaussian  image  for  the  image 
region  and  correlating  the  surface  orientation  his¬ 
togram  with  the  database. 


Fitting  Parametric  Surfaces  to  DEMs  [Jaynes’96b]: 
Fit  a  model  to  a  region  within  the  elevation  data. 
The  model  parameters  should  have  already  been  de¬ 
termined  through  another  processes  (model  index¬ 
ing,  for  example). 


As  research  into  Ascender  II  continues,  more  lU  al¬ 
gorithms  will  be  added  to  the  system.  However,  in 
order  for  the  Ascender  II  framework  to  be  useful, 
the  cost  of  adding  a  new  algorithm  to  the  system 
must  not  be  prohibitive,  something  that  proved  to 
be  a  problem  in  earlier  knowledge-based  vision  sys¬ 
tems  [Draper ’89].  Only  two  components  are  neces¬ 
sary  to  convert  an  lU  algorithm  into  an  evidence 
policy  that  are  usable  by  the  system.  First,  the 
context  in  which  the  algorithm  is  intended  to  be 
run  must  be  defined.  Currently,  the  definition  of 


allowable  contexts  is  straightforward  and  only  dis¬ 
allows  algorithms  to  be  run  in  invalid  contexts  (on 
the  wrong  type  of  data,  for  example).  This  is  sim¬ 
ilar  to  the  Context  Sets  introduced  in  the  Condor 
system  [Strat’93]  and  rule  packets  within  the  HUB. 
This  definition  of  context  is  expected  to  be  too  sim¬ 
ple  for  our  needs  and  eventually  the  framework  will 
be  extended  to  allow  the  definition  of  a  performance 
profile  for  each  algorithm  that  defines  the  expected 
performance  of  the  algorithm  under  different  con¬ 
texts.  Secondly,  a  method  for  deriving  a  certainty 
value  from  the  output  of  the  algorithm  must  be  de¬ 
fined.  This  certainty  value  is  used  by  the  system 
to  update  the  knowledge  base  using  Bayesian  in¬ 
ference.  For  example,  the  detection  of  L-junctions 
within  a  region  of  the  image  must  be  converted  to 
a  single  value  that  represents  the  probability  that 
the  L-junctions  are  present. 

2.3  Preliminary  Tests 

An  experiment  was  conducted  on  a  scene  from  the 
Fort  Hood  dataset.  The  test  was  both  a  sim¬ 
ple  example  of  the  concepts  presented  here  and  a 
demonstration  of  the  communication  mechanisms 
that  have  been  constructed  2is  part  of  the  Ascen¬ 
der  II  system.  A  small  schematic  network  (only 
four  nodes)  was  engineered  that  attempts  to  clas¬ 
sify  rectangular  building  boundaries  (called  build¬ 
ing  footprints)  according  to  the  three  categories. 
Single,  Multi-level,  and  Multiple  that  corre¬ 
spond  to  the  case  of  a  single  planar  rooftop,  several 
planes  or  slopes  at  different  heights,  or  more  than 
one  building  in  the  region. 

The  network  used  for  the  test  is  shown  in  figure  2. 
The  network  encodes  the  fact  that  the  classification 
of  the  footprint  is  dependent  upon  the  presence  of 
certain  junction  types  along  the  edges  of  the  region, 
and  the  quality  of  a  single  planar  surface  fit  to  the 
corresponding  elevation  data.  An  evidence  policy 
that  defined  the  plane  fit  algorithm  and  its  reliance 
on  avalable  elevation  data  was  constructed  for  the 
Plan  Level  node.  Similarly,  evidence  polices  for 
both  L  and  T  junctions  were  constructed. 

Each  child  node  in  the  network  hzis  an  associated 
conditional  probability  table  that  encodes  object 
specific  knowledge.  The  conditional  probability  val¬ 
ues  are  engineered  for  the  specific  problem,  and,  for 
the  test  here,  were  constructed  based  on  our  ex¬ 
perience  with  both  the  evidence  policies  and  the 
domain. 
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Figure  2:  The  network  used  to  control  the  classification 
of  a  building  region  into  one  of  the  three  possibilities: 
single  building,  double  building  or  multilevel  building. 


The  2D  polygon  detector  used  in  Ascender  I  was 
run  on  a  single  downlooking  view  of  a  subregion  of 
the  Fort  Hood  dataset.  Ten  of  the  polygons  were 
selected  for  classification  using  the  Ascender  II  sys¬ 
tem.  In  all,  four  polygons  contained  single  level 
rooftops,  four  contained  multi-level  buildings,  and 
two  contained  more  than  one  building.  Image  3 
shows  a  typical  polygon  for  each  of  the  classes. 


Figure  3:  Three  different  building  footprint  cases.  Left¬ 
most:  Single  Rooftop.  Center:  Multi-level  Building. 
Rightmost:  Two  distinct  buildings. 


The  system  was  run  on  all  ten  regions  and  stopped 
when  a  belief  value  for  one  of  the  states  for 
footprint  class  exceeded  65%  or  the  controller 
was  unable  to  select  an  new  action.  The  region  is 
then  classified  according  to  the  state  of  footprint 
class  with  the  maximum  belief  value.  The  table 
below  shows  the  results  of  the  experiment  and  the 
number  of  vision  algorithms  executed  in  order  to 
classify  the  region. 


Polygon  Type 

^  Actions 

Classification/Belief 

Single 

1 

Single  (75%) 

Mult- Level 

6 

Multi-Level  (57%) 

Multiple 

4 

Multiple  (57%) 

Single 

4 

Single  (55%) 

Multi-Level 

6 

Multi-Level  (50%) 

Multi-Level 

2 

Single  (75%) 

Single 

2 

Single  (75%) 

Multi-Level 

4 

Multi-Level  (83%) 

Single 

4 

Single  (81%) 

Multiple 

6 

Multiple  (65%) 

3  Future  Directions 

Ascender  II  is  based  on  a  much  more  flexible  design 
than  was  Ascender  I.  Our  goal  is  to  demonstrate 
that  this  flexibility  improves  system  performance 
and  widens  its  scope  of  applicability.  To  this  end, 
work  is  underway  on  engineering  the  software  ar¬ 
chitecture  of  Ascender  II  and  on  the  development 
of  additional  evidence  policies  for  a  wider  range  of 
building  classes.  The  general  framework  being  em¬ 
ployed  supports  any  type  of  data  as  long  as  there 
are  corresponding  evidence  policies  available  for  in¬ 
terpreting  it.  Consequently,  the  system  is  being  ex¬ 
tended  to  include  IFSAR  elevation  maps  (in  addi¬ 
tion  to  elevation  maps  from  traditional  stereo  tech¬ 
niques)  and  multi-spectral  imagery  for  improved 
ground  classifications.  We  expect  to  use  the  Fort 
Hood  image  dataset  as  well  as  other  datasets  as  they 
become  available  (e.g.  Ft.  Benning)  to  demonstrate 
the  Ascender  II  system. 

There  are  many  issues  to  be  addressed  during  the 
design  and  implementation  of  Ascender  II.  One  is¬ 
sue  concerns  the  ganularity  of  the  lU  algorithms 
employed  in  the  system  and  how  this  affects  sys¬ 
tem  performance.  For  example,  should  Ascender  I 
be  dismantled  into  component  parts  and  reassem¬ 
bled  in  the  knowledge  network?  Previous  attempts 
to  build  knowledge-based  systems  ran  into  major 
knowledge  engineering  problems.  The  treatment 
of  lU  algorithms  as  black-box  evidence  gathering 
mechanisms,  regardless  of  the  underlying  complex¬ 
ity,  may  be  one  way  to  avoid  this.  Currently,  sim¬ 
ple  greedy  evidence  policy  is  being  used  to  select  the 
next  action.  What  other  policies  are  reasonable  and 
how  do  the  affect  the  system  efficiency?  Techniques 
that  compare  the  expected  utility  of  applying  a  par¬ 
ticular  evidence  policy  to  its  expected  cost  will  be 
investigated  as  one  way  to  answer  the  question  of 
efficient  control. 
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Abstract 

In  this  paper  we  focus  on  two  recent  improvements  in 
3D  scene  interpretation  using  the  UMass  terrain  and 
site  model  reconstmction  systems  (Terrest  and 
Ascender).  The  first  utilizes  a  computationally 
efficient  implementation  of  adiq^tive  windowing  to 
inprove  rooftop  elevation  estimates,  which  are  used 
as  input  for  robust  building  model  generatioa  The 
second  utilizes  intermediate  results  generated  by 
Terrest  to  generate  a  new  set  of  3D  features,  which  are 
used  as  input  to  a  terrain  classification  system.  The 
techniques  were  evaluated  by  analyzing  aerial  images 
of  Fort  Hood,  Texas.  It  was  shown  that  these 
techniques  resulted  in  significant  improvement  in  the 
quality  of  reconstructed  building  models  and  the 
accuracy  of  terrain  classification. 

1  Adaptive  windowing 

The  surfac  reflectance  properties  of  most  man-made 
and  natural  objects  are  approximately  Lambertian 
(i.e.,  the  brightness  of  a  surface  feature  is  independent 
of  the  position  of  observer).  Texture  based  image 
matching  algorithms  exploit  this  phenomena  by 
looking  for  similar  texture  patterns  in  a  pair  of  images. 
These  algorithm  have  trouble  near  occluded 
boundaries,  non-Lambertian  features,  when  image 
noise  obscures  surface  texture,  or  when  the  sensors  are 
widely  spaced.  In  previous  work,  we  describe  several 
methods  to  compensate  for  perspective  distortion 
introduced  by  widely  spaced  cameras  (Schultz,  1994, 


*  This  work  was  supported  in  part  by  the  Defense  Advanced 
Research  Projects  Agency  (via  TEC)  contract  number 
DACA76-92-C-0041,  the  Office  of  Naval  Research  grant 
number  N00014-89-J-3229,  and  by  the  National  Science 
Foundation  grant  number  CDA-8922572. 


1995).  Although  these  techniques  were  shown  to  be 
successful  in  terrain  reconstruction,  they  produced 
unreliable  results  near  sharp  discontinuities  in 
elevation  (building  edges). 

Consider  the  problem  of  computing  elevations  of  a  flat 
roof  building.  When  the  center  of  the  matching 
window  is  located  on  the  roof,  and  near  the  comer, 
most  of  the  windowed  pixels  are  not  on  the  roof.  As  a 
result,  the  ground  texture  dominates,  and  the  rooftq) 
height  is  assigned  to  ground  level.  This  results  in  a 
melted  building  effect,  which  can  be  seen  on  the  cover 
of  the  1994  lUW  proceedings  (Jaynes,  et  al.,  1996). 

To  prevent  errors  caused  by  discontinuities  in 
disparity  Kanade  and  Okutomi  (1994)  proposed  using 
a  deformable  matching  window.  Their  method 
assumes  no  a-priori  knowledge  about  the  surface 
structure.  Instead  the  algorithm  assumes  that  disparity 
discontinuities  are  associated  with  gray-scale 
discontinuities  (a  condition  commonly  found  at  the 
edges  of  building).  Although  this  assumption  is  valid 
for  many  man-made  objects,  it  often  fails  to  describe  a 
wide  range  of  real  world  conditions,  such  as  rooftq) 
cluttCT,  shadows,  patchy  vegetation,  and  road  edges. 
The  adaptive  windowing  technique  was  adapted  to  the 
problem  of  analyzing  complex  scenes  by  employing  a 
building  detection  algorithm  to  identify  disparity 
discontimuties. 

Instead  of  looking  fcr  jumps  in  gray-scale  to  identify 
disparity  discontimuties,  we  relied  on  detected  rooftcp 
polygons.  Fust,  the  algorithm  detects  rooftcp 
polygons  in  the  reference  image  using  a  2D  building 
detection  algcdthm.  The  algorithm  then  generates  a 
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binary  building  mask  by  filling  in  and  dilating  the 
polygons. 

When  the  building  mask  method  is  used  during 
reconstruction,  only  pixels  that  lie  within  the  mask 
contribute  to  the  match  score.  Furthermore,  disparities 
are  computed  only  if  the  center  of  the  matching 
window  lies  within  the  building  mask. 

For  the  reference  image,  the  masking  operation  is 
straight  forward  -  all  computations  affected  by  the 
building  mask  can  be  carried  out  in  advance.  The 
system  pre-multiplies  the  reference  image  by  the 
building  mask,  and  counts  the  number  of  rootlq) 
pixels  surrounding  each  image  pixel.  Unfortunately, 
most  of  these  operations  cannot  be  pre-computed  for 
the  target  image,  because  the  p-ojection  of  the 
building  mask  onto  the  target  image  depends  on  the 
elevations,  which  is  not  known  in  advance. 

To  maintain  computational  efficiency,  the  im-masked 
mean  and  variance  terms  in  the  cross-correlation 
match  score  (Schultz,  1994)  are  substimted  for  the 
masked  ones.  The  covariance  term  is  not 
q)poximated  because  all  non-roof  pixels  in  the 
reference  image  have  a  value  of  zero.  After  the 
matching  process  is  completed,  elevations  are 
generated  from  the  computed  disparity  map. 

The  algorithm  was  tested  on  a  randomly  selected 
cluster  of  peaked  roof  building  in  the  Fort  Hood  data 
set  Figures  1  shows  the  reference  image,  rooftop 
mask,  target  image,  and  the  recovered  OEMs  with  and 
without  using  the  building  mask.  The  fitted  peaked 
roof  models  generated  by  Ascender  (Jaynes,  et  al. 
1996)  are  shown  in  Figure  2. 


Although  the  buildings  have  symmetric  peaked  roofs, 
we  fitted  an  asymmetric  peaked  roof  model  (the 
position  of  the  center  line  was  an  adjusted  parameter). 
This  allowed  us  to  test  the  effect  of  the  ad^tive 
windowing  on  the  Ascender  robust  model  fitting 
algorithm.  Unfortunately,  ground  truth  is  not 
available  fix  this  data  set  Nevertheless,  based  on  the 
position  of  the  roof  ridge,  it  is  parent  that  the 
ad^ve  window  iipjroved  the  quality  of  ttie 
reconstructed  model  by  removing  the  effects  of 
surrounding  objects. 

The  improvement  in  the  reconstruction  accuracy  may 
be  attributed  to  the  building  mask  removing  file 
influence  of  objects  that  suiround  the  buildings.  The 
trees  at  the  right  and  bottom  of  the  scene  artificially 
raise  the  rooftop  elevations,  and  the  parking  lot  on  the 
left  artificially  lowers  the  rooftqi  elevations. 

2  Using  3D  features  to  improve  terrain 
classification 

Texture  analysis  is  an  important  area  in  conputer 
vision  and  has  been  extensively  studied  (Tamura,  et 
at.,  1978;  Haralick,  1979;  Gool,  et  al.,  1985;  and 
Tuceryaa  1993).  While  it  is  widely  accepted  that 
texture  does  not  have  a  pecise  definition,  the 
impotance  of  texture  for  terrain  classification  has 
been  well  established.  Tuceryan  and  Jain  (1993) 
identified  four  basic  approaches  to  texture  analysis  - 
statistical,  geometrical,  model-based,  and  signal 
processing  methods.  Although  the  methodologies 
vary  substantially  from  one  algorithm  to  another,  they 
generally  assume  that  texture  is  a  charactoistic  of  the 
2D  image  and  indirecfiy  related  to  the  3D  structure  of 
tiK  surface. 


Building  DEM  DEM 

Reference  mask  Target  with  mask  without  mask 


Figure  1.  Reconstruction  example  with  and  without  a  building  mask. 
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with  without 


Figure  2.  The  reconstructed  building  models. 


The  limitation  of  this  approach  is  that  the  3D 
properties  of  real  objects  are  not  utilized.  Recently,  in 
a  series  of  experiments  at  UMass  (Wang  et  al.,  1997), 
we  investigated  the  role  of  3D  texture  on  terrain 
classification.  This  work  examined  ways  of  inferring 
3D  surface  textures  (such  as  coarseness  and 
roughness)  from  images. 

Because  information  about  the  3D  miao-structure  of  a 
scene  usually  is  not  available,  3D  features  must  be 
derived  from  images  or  other  remotely  sensed  data, 
Instead  of  reconstmcting  a  scene  and  then  computing 
a  3D  feature  set,  we  propose  using  the  statistical 
properties  of  disparity,  which  are  derived  from 


intermediate  results  confuted  during  the  image 
matching  process. 

For  evoy  pixel  in  the  reference  image  Terrest 
searches  along  epipolar  lii^  in  the  target  image  for  a 
match.  At  all  possible  disparity,  between  the  ujper 
and  lowCT  bound,  Terrest  generates  match  scores, 
which  are  considered  to  be  independent  point 
estimates  of  a  smooth  similarity  flinction  (SF). 

About  its  peak  the  SF  is  parameterized  by  a  parabola 
with  negative  curvature  (the  peak  is  the  highest  point). 
Ate  computing  the  soies  of  match  scores,  a  parabola 
is  fitted  to  the  date.  If  the  curvature  is  negative  and  the 
peak  lies  between  the  upper  and  lower  bounds  of  the 
search  range,  then  the  SF  is  considered  well  defined. 

Four  3D  features  (which  are  summarized  in  Table  1) 
based  on  the  spatial  statistics  of  the  fitted  similarity 
ftmction  were  computed.  Details  of  the  generation  of 
these  3D  features  can  be  foimd  in  (Wang,  1997a,b). 

Experiments  were  conducted  to  test  the  value  of  3D 
features  in  terrain  classificatioa  Figure  3  shows  the 
stereo  pair  used  for  the  test  The  images  were  2k  x  2k 
sections  extracted  fi'om  aerial  images  of  a  rural  area  of 
Fort  Hood,  TX.  The  cameras  were  tilted  at  37°  and 
54°,  and  the  ground  sampling  distance  of  the 
reconstracted  DEM  was  0.75ra  Sample  images  of  the 
3D  features  generated  from  these  data  set  are  shown  in 
Figure  4. 


Feature 

Description 

Characteristics 

Peak 

Peak  value  of  the 
similarity  function. 

High  values  indicate  well  defined  textures  on  fiat  surfaces 
(rocky  ground).  Low  values  are  indicative  of  viewpoint 
dependent  texture,  e.g.,  branches  and  leaves  in  a  tree 
canopy. 

Variance 

Variance  of  the 
similarity  function 
peaks  in  a  17x17 
window. 

Low  values  are  characteristic  of  uniform,  well  defined 
texture  patterns.  High  values  are  associated  with 
boundaries  and  man-made  objects. 

Curvature 

Curvature  of  the 
similarity  function 
about  its  peak. 

Large  values  are  associated  with  high  contract  textures  that 
vary  over  a  few  pixels.  Larger  values  are  associated  with 
blurry  objects. 

Density 

Density  of  well 
defined  similarity 
functions  in  a  17x17 
window. 

Poorly  defined  similarity  functions  are  associated  with 
areas  of  unreliable  match  scores,  such  as  occluded  objects 
and  low  contrast  areas  (e.g.,  paved  roads  and  water). 

Table  1.  Definitions  and  characteristics  of  the  3D  features  generated  from  image  matching. 
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Figure  3.  Overlapping  pair  of  images  used 
generate  2D  and  3D  features. 


In  addition  to  the  3D  features,  a  set  of  twelve  standard 
2D  co-occurrence  features  (Haralick  et  al.,  1973)  also 
were  computed.  Three  features  sets  were  analyzed: 
(A)  the  12  cooccurrence  features,  plus  image 
intensities,  (B)  the  3D  features  (described  in  Table  1), 
plus  image  intensities,  and  (C)  the  combination  of  A 
andB. 

We  selected  a  linear  discriminant  classification 
scheme  based  on  the  the  Foley-Sammon  transform 
with  a  minimum  Mahalanobis  distance  selection 
criterion  (Foley,  1975).  Four  classes  were  consictoed 
in  the  experiments,  namely,  foliage,  grass,  bare 
ground,  and  shadow.  Very  small  training  data  sets 


were  used  which  accounted  for  approximately  1%  of 
the  total  area  covered  in  the  orhto-image. 

An  analysis  of  the  classification  accuracy  is  shovm  in 
Table  2.  Each  row  of  tte  table  ^ows  the  class 
assignments  (in  terns  of  numbers  of  pixels)  for  pixels 
of  a  known  class.  The  accuracy  measure  was  obtained 
by  summing  the  diagonal  elements  and  dividing  by  the 
total  number  of  pixels.  It  can  be  seen  tiiat  the  2D 
feature  resulted  in  the  lowest  classification  accuracy, 
followed  by  3D  features,  and  the  combined  2D  and 
3D  features.  For  this  experiment,  the  results  show 
that  3D  features  significantly  inprove  the 
classification  ability.  The  classification  accuracy  of 
the  3D  feature  over  the  conventional  2D  set  jumps 
fi-om  72.5  to  82  pacent,  with  only  minor  in^rovement 
when  both  2D  and  3D  feature  sets  are  included. 

3  Conclusion 

We  presented  two  inprovements  to  3D  scene 
interpretation  and  andysis.  A  computation^y 
efficient  ad^tive  windowing  technique  was  tested  in 
a  building  reconstruction  scenario,  and  was  shown  to 
significantly  inprove  3D  model  estimation  in 
conplex  scenes. 
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Figure  5.  Classification  results  using  2D  and  3D  features. 


Ground  truth 

1  shadow 

grass 

1  bare  ground  1 

2D  Features  (overall  accuracy:  72.5%)  \ 

shadow 

23.7 

7.0 

11.1 

0.0 

grass 

40.8 

332.6 

308.1 

1.6 

foliage 

11.7 

55.0 

950.7 

1.2 

bare  ground 

HBH 

52.2 

12.9 

31.2 

97.2 

3D  Features  (overall  accuracy:  82.0%)  I 

shadow 

3.0 

___ 

38.8 

0.0 

grass 

0.0 

231.4 

12.2 

foliage 

0.4 

995.3 

2.9 

bare  ground 

Hsm 

0.0 

17.3 

25.1 

150.9 

2D  and  3D  Features  (overall  accuracy:  83.4%) 

shadow 

(41.8) 

13.8 

0.0 

27.8 

0.2 

grass 

0.0 

468.6 

202.7 

11.8 

foliage 

2.0 

18.0 

995.7 

2.8 

bare  ground 

0.0 

33.9 

21.4 

138.1 

Table  2.  Contingency  analysis  of  classification  results  (units  in  1000  pixels). 


In  the  second,  we  derived  a  set  of  3D  features  from  the 
intermediate  results  generated  during  image  matching. 
The  classification  ability  of  3D  features  were  tested  on 
a  rural  scene  containing  four  classes  (trees,  foliage, 
grass,  and  bare  ground).  The  results  of  the  experiment 
showed  that  using  3D  features  can  significantly 
improve  classification  accuracy. 

Both  of  these  enhancement  have  far  reaching 
implications  fcff  efficient  and  robust  reconstruction 
and  analysis  of  complex  scenes.  In  ftiture  w(yk  these 
algorithms  will  be  incorporated  into  the  automatic. 


knowledge  driven  system  Terrest  and  Ascender, 
which  are  currently  under  develqiment  at  UMass. 
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Abstract 

The  introduction  of  Interferometric  Synthetic 
Aperture  Radar  (IFSAR)  to  the  Image  Under¬ 
standing  (lU)  community  provides  researchers 
with  a  valuable  source  of  elevation  data  for 
many  common  lU  tasks  including  site  recon¬ 
struction.  We  show  that  the  new  data  can  ei¬ 
ther  be  used  to  augment  processing  of  existing 
electro-optical  (EO)  images  of  a  site  or  can  be 
exploited  as  an  independent  data  source.  This 
paper  explores  the  initial  application  of  existing 
lU  techniques  to  IFSAR  images,  develops  new 
reconstruction  algorithms  specifically  for  noisy 
IFSAR  data,  and  examines  the  feasibility  of  fus¬ 
ing  IFSAR  and  EO  algorithms  to  arrive  at  a 
more  reliable  site  reconstruction. 

Buildings  regions  are  detected  through  segmen¬ 
tation  of  the  IFSAR  data  directly  or  through 
delineation  of  rooftop  boundaries  in  a  registered 
EO  image.  A  top-down  model  indexing  phase 
automatically  determines  the  rooftop  shape  of 
segmented  regions.  Finally,  robust  parametric 
model  fitting  of  the  selected  rooftop  determines 
the  precise  shape  and  location  of  buildings  at 
the  site. 

1  Introduction 

As  part  of  the  UMass  effort  in  the  APGD  pro¬ 
gram,  there  is  an  ongoing  research  effort  to  ex- 

"This  work  was  supported  in  part  by  DARPA  under 
contract  DACA76-92-C-0041  (via  TEC),  by  NSF  grant 
number  CDA-8922572,  and  by  Lockheed  Martin  under 
Subcontract/PO  number  RRM072030. 


plore  the  feasibility  of  using  Inferometric  Syn¬ 
thetic  Aperture  Radar  (IFSAR)  as  a  primary 
data  source  for  lU  tasks.  In  particular,  we  claim 
that  IFSAR  is  a  valuable  source  of  elevation 
data  that  can  be  used  to  support  3D  site  re¬ 
construction  tasks  [Chellappa  et  al.,  1996].  We 
have  identified  some  of  the  characteristics  of 
the  data  that  make  the  application  of  lU  al¬ 
gorithms  to  IFSAR  a  unique  problem  unto  it¬ 
self.  The  straightforward  application  of  exist¬ 
ing  techniques  to  IFSAR  imagery  may  fail,  and 
numerical  methods  such  as  surface  fitting  may 
have  to  be  tuned  to  the  statistical  properties 
of  the  data.  Successful  exploitation  of  IFSAR 
by  the  lU  community  will  require  both  the  de¬ 
velopment  of  new  techniques  and  the  strategic 
application  of  old  ones. 

2  A  Statistical  Profile  of  the  Data 

An  analysis  of  the  Sandia  Kirtland  AFB  data 
set  was  undertaken  and  two  properties  of  IF¬ 
SAR  relevant  to  lU  algorithms  have  been  noted 
and  focussed  upon  as  part  of  the  study.  The  first 
such  property  is  that  the  Sandra  Kirtland  AFB 
data  set  studied  had  a  large  number  of  dropouts. 
These  are  points  within  the  image  that  contain 
no  data  as  a  result  of  poor  correlations  during 
the  reconstruction  of  IFSAR  from  its  two  com¬ 
ponent  synthetic  aperture  radar  (SAR)  images 
[Ulaby  et  al.,  1981].  These  dropouts  clustered 
around  building  edges,  often  completely  occlud¬ 
ing  rooftop  boundaries  (see  Figure  IB).  The  sec¬ 
ond  relevant  property  of  the  data  is  a  significant 
presence  of  outlier  noise  due  to  errors  inherent 
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Figure  1:  Figures  A  and  B  illustrate  the  Kirtland  AFB  data  set.  Figure  C  are  the  rooftop  regions 
found  by  the  technique  in  Section  3.1  and  Figure  D  is  the  final  reconstruction  yielded 
by  applying  the  model  selector  to  the  rooftops  in  C. 


in  the  IFSAR  construction  process.  Table  1  lists 
the  estimated  inlier,  outlier  and  dropout  per¬ 
centages  of  the  Kirtland  data  set.  These  esti¬ 
mates  were  obtained  by  robustly  fitting  planes 
to  several  rooftops  known  to  be  flat  and  analyz¬ 
ing  the  residual  errors  of  the  fit. 

Table  1;  Analysis  of  the  Sandra  Kirtland  AFB 
dataset. 


Inliers 

47.87% 

Outliers 

34.80% 

Dropouts 

17.33% 

The  striking  frequency  of  outliers  and  dropouts 
has  motivated  our  use  of  the  following  lU 
paradigms  in  site  reconstruction  tasks  involving 
IFSAR: 

Multisensor  Fusion  The  use  of  a  secondary 
data  type(s)  can  serve  as  a  means  of  sup¬ 
porting  IFSAR  processing  in  areas  where  it  is 
weak,  while  still  allowing  the  exploitation  of  its 
strengths.  In  particular,  we  focus  on  using  EO 
along  with  IFSAR. 

Robust  Statistical  Analysis  Robust  meth¬ 
ods  are  required  in  order  to  overcome  outlier 
noise  and  converge  on  correct  model  fits. 

Top-Down,  Model  Based  Analysis  The  use 
of  building  models  when  applied  to  the  elevation 
data  provides  a  constraining  context  helpful  in 
the  interpretation  of  noisy  data  sets. 


It  should  be  understood  that  the  data  analysis 
has  been  based  upon  a  particular  dataset,  the 
Sandia  Kirtland  data.  However,  we  believe  the 
approaches  discussed  here  will  still  be  valuable 
and  relevant  to  higher  grade  data.  Note  that  the 
reconstruction  algorithms  presented  here  were 
designed  with  respect  to  the  expected  geometry 
of  buildings,  not  on  IFSAR  noise-specific  arti¬ 
facts.  They  are  therefore  generalizable  to  other 
DEM  sources  of  data,  such  as  standard  stereo- 
optical. 

3  Methods  for  Site  Reconstruction 

The  site  modeling  task  focussed  upon  in  this  pa¬ 
per  is  the  3D  reconstruction  of  buildings  from  an 
IFSAR  image  and,  possibly,  an  EO  image.  The 
reconstruction  process  is  broken  into  two  sepa¬ 
rate  stages.  The  first  stage  locates  all  the  build¬ 
ing  rooftops  in  the  available  data,  and  three 
different  approaches  to  this  problem  were  stud¬ 
ied.  Two  of  the  approaches  use  a  bottom-up, 
region  growing  scheme  to  locate  rooftops  in  an 
IFSAR  image,  while  the  third  produces  polyg¬ 
onal  rooftop  boundaries  in  a  registered  optical 
image  through  the  perceptual  grouping  of  image 
features,  such  as  corners  and  lines,  into  closed 
polygonal  forms.  The  second  stage  of  processing 
then  fits  each  of  these  rooftops,  or  focus  of  atten¬ 
tion  regions,  to  a  single  shape  primitive  selected 
from  a  database  of  such  primitives  to  yield  the 
final  reconstruction. 
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Figure  2:  A  cross-sectional  view  of  the  elevation  data  for  a  flat  (top)  and  a  peaked  (bottom) 
rooftop  and  their  corresponding  projections  into  an  elevation  histogram.  Note  that  flat 
surfaces  form  sharp  spikes  in  the  histogram  while  peaked  surfaces  form  plateaus. 


3.1  Bottom  -  Up  IFSAR  processing 

The  first  of  the  two  bottom-up,  region  grow¬ 
ing  strategies  goes  through  two  separate  stages 
of  processing  to  locate  building  rooftops  in  the 
image.  The  first  stage  of  processing  estimates 
the  likelihood  that  each  point  in  the  image  it 
belongs  to  some  rooftop  in  the  scene.  The  key 
spatial  constraint  is  that  a  building  rooftop  will 
appear  at  a  higher  elevation  than  most  of  the 
ground  immediately  surrounding  it.  If  we  as¬ 
sume  that  the  entire  site  lies  on  a  single,  flat 
ground  plane  whose  normal  is  parallel  to  the 
direction  of  gravity^  then  there  are  sufficient 
constraints  to  allow  discrimination  between  the 
two  via  an  analysis  of  an  elevation  histogram. 

One  such  constraint  is  that,  ideally,  every  point 
on  the  ground  will  have  the  same  elevation 
value.  Thus,  the  ground  would  appear  as  a  uni- 
modal  distribution  in  the  elevation  histogram  of 
the  image.  Here,  a  distribution  is  considered  to 
be  unimodal  if  the  number  of  votes  in  the  bins 
comprising  that  distribution  decrease  monoton- 
ically  with  their  distance  from  the  distribution’s 
peak  (i.e.  the  bin  with  the  most  votes).  Thus,  a 
unimodal  distribution  is  not  necessarily  a  nor¬ 
mal  distribution  in  our  case.  This  avoids  mak¬ 
ing  strong  independence  assumptions  about  the 
relationship  between  noise  and  elevation. 

This  ground  “mode”  will  be  associated  with  the 
first  significant  distribution  encountered  in  the 
histogram  and  will  be  considered  to  begin  at 
the  zeroth  elevation  bin  in  the  histogram.  Fur- 

*Note  that  while  this  assumption  at  a  global,  site¬ 
wide  level  is  overly  constraining,  one  can  reasonably  ex¬ 
pect  it  to  hold  locally,  giving  rise  to  the  adaptive  thresh¬ 
olding  approach  discussed  in  Section  4 


thermore,  any  distributions  formed  in  the  his¬ 
togram  by  points  lying  on  building  rooftops  will 
be  assumed  to  be  distinguishable  from,  and  to 
the  right  of,  this  ground  mode  (see  Figure  2). 
Rooftop  points  can  therefore  be  segregated  from 
the  ground  by  thresholding  the  image  at  the  el¬ 
evation  value  corresponding  to  the  valley  sepa¬ 
rating  the  first  (ground)  distribution  from  the 
other  (rooftop)  distributions. 

The  valley  best  separating  the  ground  mode 
from  other  modes  in  the  histogram  is  selected 
via  a  simple  fitness  heuristic.  A  local  minima’s 
fitness  is  based  on  the  unimodality  of  the  distri¬ 
bution  as  defined  by  that  local  minima  and  the 
magnitude  of  the  division,  or  bifurcation,  cre¬ 
ated  in  the  histogram  by  that  local  minima.  A 
local  minima  is  considered  a  bifurcation  point 
for  a  histogram  since,  ideally,  each  such  min¬ 
ima  would  represent  the  bottom  of  the  valley 
lying  between  two  different  modes.  These  crite¬ 
rion  are  used  to  avoid  splitting  at  trivial  min¬ 
ima  that  correspond  to  artifacts  of  the  image 
noise,  while  still  selecting  a  relatively  unimodal 
ground  distribution. 

The  metric  for  measuring  the  magnitude  of  the 
bifurcation  created  by  a  minima  naturally  sug¬ 
gested  by  the  monotonicity  constraint  for  uni¬ 
modal  distributions  is  illustrated  in  Figure  3. 
It  is  simply  the  amount  that  would  have  to  be 
“filled  in”  if  the  distributions  on  either  side  of 
the  minima  were  to  be  merged  into  a  single, 
unimodal  distribution.  Analysis  of  the  degree 
of  unimodality  (i.e.  the  fitness  heuristic  moti¬ 
vated  by  Figure  3)  to  remove  local  minima  and 
merge  minor  modes  allows  the  selection  of  the 
key,  initial  ground  plane  mode.  Points  are  as- 


Figure  3:  The  magnitude  for  the  minima  V  is  simply  the  shaded  area  beneath  the  dashed  line. 

This  is  the  total  number  of  votes  that  the  histogram  would  have  to  be  changed  by  in 
order  to  accommodate  a  merger  between  the  modes  on  either  side. 


signed  likelihood  measures  based  on  their  dis¬ 
tance  from  the  ground  mode’s  peak  (i.e.  their 
distance  from  the  ground  plane).  These  likeli¬ 
hood  measures  are  used  in  the  second  stage  to 
classify  planar  patches  into  roof  and  ground  hy¬ 
potheses.  Next  we  discus  the  process  of  finding 
the  planar  patch  decomposition. 

The  second  stage  decomposes  the  image  into  a 
component  set  of  planar  patches  and  then  clas¬ 
sifies  them  into  ground  or  rooftop  hypotheses. 
This  segmentation  is  obtained  by  growing  out¬ 
ward  from  small  seed  regions  until  the  error  of 
the  best  planar  fit  to  that  region  exceeds  an 
accuracy  parameter  (this  can  be  related  to  the 
expected  variance  of  the  data).  Seed  regions 
are  small  blocks  of  image  points  to  which  a  sin¬ 
gle  plane  can  be  fit  below  some  residual  error. 
The  best  planar  fit  for  a  region  being  grown 
is  estimated  via  a  generalized  Hough  transform 
[Haralick  and  Shapiro,  1992]  in  a  three  param¬ 
eter  space  that  describes  the  surface  normal  of 
a  plane.  This  process  of  locating  seed  regions 
and  then  growing  outward  from  them  is  iter¬ 
ated  until  every  point  in  the  image  belongs  to 
one  of  these  planar  regions. 

After  the  image  has  been  fully  decomposed  into 
these  patches,  roof  and  ground  assignments  are 
made  to  them  based  on  the  likelihood  measures 
of  their  component  points  (these  are  the  likeli¬ 
hood  measures  computed  in  the  first  stage.  Ad¬ 
jacent  rooftop  patches  are  then  merged  when¬ 
ever  such  a  merger  is  likely  to  form  a  reason¬ 
able  roof  shape  (i.e.  one  that  is  not  concave). 
The  resulting  set  of  rooftop  patches  are  filtered 
on  their  size  relative  to  other  patches,  and  the 
remainder  serves  as  our  focus  of  attention  re¬ 
gions.  The  focus  of  attention  regions  located  by 


this  technique  in  the  Sandia  Kirtland  AFB  data 
set  are  shown  in  Figure  1C. 

The  second  of  the  two  bottom-up  strategies  also 
decomposes  an  IFSAR  image  into  a  set  of  com¬ 
ponent  planar  patches  to  facilitate  rooftop  de¬ 
tection,  but  performs  this  segmentation  in  a 
more  sophisticated  manner  [Piater,  1996].  The 
image  is  first  decomposed  into  a  set  of  (possibly 
overlapping)  circular  tiles  with  a  user  defined 
diameter.  The  individual  tiles  are  then  fit  to 
planes  to  create  an  initial  segmentation,  after 
which  a  split-and-merge  phase  is  entered. 

The  problem  of  deriving  an  optimal  splitting 
and  merging  schedule  for  a  given  tiling  is  com¬ 
binatorial  in  nature  and  therefore  requires  the 
use  of  an  approximate  method  if  a  solution  is 
to  be  tractable.  An  iterative  refinement  ap¬ 
proach  is  taken,  where,  at  each  iteration,  pro¬ 
posed  splits  and  merges  are  ranked  by  a  locally 
computed  quality  heuristic.  The  merger  receiv¬ 
ing  the  highest  rank  is  then  instantiated  at  the 
end  of  the  iteration,  and  any  other  proposal 
affected  by  this  merger  has  its  quality  metric 
updated.  The  quality  metric  for  a  proposed 
merge  is  inversely  proportional  to  the  differ¬ 
ence  between  the  two  candidates’  surface  nor¬ 
mals  and  the  perpendicular  distance  between 
them.  These  iterations  are  continued  until  the 
global  error  of  the  planar  fits  fails  to  decrease 
by  some  pre-specified  amount.  After  the  split¬ 
ting  and  merging  is  done,  building  rooftops  are 
extracted  based  on  elevation. 

3.2  Top  -  Down,  Optical  Processing 

The  third  approach  uses  a  registered  optical  im¬ 
age  to  extract  rooftop  polygons  via  a  perceptual 
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grouping  scheme  that  delineates  rooftop  bound¬ 
aries.  The  rooftop  hypotheses  are  projected  into 
the  registered  IFSAR  data  where  they  form  fo¬ 
cus  of  attention  regions.  The  rooftop  classifi¬ 
cation  process  described  in  the  next  section  is 
then  applied  to  these  regions  to  produce  rooftop 
models.  Details  on  the  optical  image  processing 
can  be  found  in[Jaynes,  1994]. 

3,3  Rooftop  Shape 
Classification 

Once  possible  rooftop  regions  have  been  ex¬ 
tracted,  an  automatic  model  indexing  phase  at¬ 
tempts  to  classify  the  their  shape.  The  classifi¬ 
cation  scheme  indexes  into  a  database  of  surface 
primitives  based  on  an  analysis  of  the  differen¬ 
tial  geometry  of  the  elevation  data  within  each 
region.  Indexing  is  based  on  estimating  the  sur¬ 
face  orientations  of  small  surface  patches,  and 
constructing  an  orientation  histogram  that  is 
then  correlated  with  an  existing  library  of  roof 
models.  These  orientation  histograms,  some¬ 
times  called  the  Extended  Gaussian  Image,  are 
normalized  so  that  they  are  both  scale  and 
translation  invariant. 

For  site  reconstruction  from  IFSAR,  the  sur¬ 
face  primitive  database  contains  a  set  of  sur¬ 
face  classes  representing  possible  rooftop  shapes 
called  surface  primitives.  Examples  are  planes, 
cylindrical  surfaces,  peaks,  and  spires.  Associ¬ 
ated  with  each  surface  primitive  are  a  number  of 
models  representing  different  parameterizations 
of  each  class  of  surface  primitives.  For  exam¬ 
ple,  the  “Peak”  surface  primitive  class  is  the 
canonical  shape  for  a  number  of  models  in  the 
database,  each  with  a  different  peak  angle.  Cor¬ 
responding  orientation  histograms  are  stored  for 
indexing  purposes.  Figure  4  shows  the  database 
used  for  the  results  shown  here.  It  contains  8 
different  surface  primitives  and  51  total  models. 

The  elevation  data  within  each  region  produced 
by  the  system  is  triangulated  into  a  simple  sur¬ 
face  using  the  well  known  Delanuy  algorithm. 
The  triangulated  surface  is  a  set  of  oriented 
triangular  surface  patches,  whose  vertices  are 
points  from  the  original  data  set. 

The  surface  normal  at  each  triangular  patch  is 


then  computed  and  used  to  “vote”  for  a  partic¬ 
ular  cell  on  the  Gaussian  sphere.  If  the  surface 
normal,  N,  intersects  the  Gaussian  sphere  at 
{x,y,z),  the  weighted  vote  is  given  by: 

=  (1) 

VzTra^ 

where  D  is  the  angular  distance  from  {x,y,z) 
to  the  center  of  the  histogram  bucket,  B,  to  re¬ 
ceive  the  weighted  vote.  Voting  for  a  given  vec¬ 
tor  stops  when  the  bucket  value  of  V {x,  y,  z,  B) 
falls  below  a  threshold  (0.1  for  the  results  shown 
here). 

To  achieve  model  indexing,  the  constructed  his¬ 
togram,  referred  to  as  the  image  histogram, 
is  then  correlated  with  each  of  the  model  his¬ 
tograms  stored  in  the  database.  The  scheme 
selected  the  best  model  through  a  comparison 
of  the  correlation  scores  and  the  orientation  at 
which  the  model  should  be  fit.  The  top  two 
models  were  fit  to  the  elevation  data  using  the 
parameters  derived  from  the  indexing  process. 
The  model  that  has  the  lowest  overall  fit  error 
is  retained  for  the  final  site  reconstruction.  For 
a  more  complete  treatment  of  this  algorithm, 
see  [Jaynes,1996].  Figure  ID  depicts  the  recon¬ 
struction  yielded  when  the  model  selector  was 
applied  to  the  focus  of  attention  regions  in  Fig¬ 
ure  1C. 

4  Future  Directions 

Several  techniques  have  been  presented  that  are 
relevant  to  the  3D  reconstruction  of  building 
sites  from  IFSAR,  along  with  a  loose  framework 
for  their  planned  composition  into  a  system.  We 
have  also  outlined  several  features  desirable  in 
any  lU  system  seeking  to  utilize  IFSAR  as  a 
primary  data  source.  One  of  these  is  the  ex¬ 
ploitation  of  multiple,  overlapping  sources  of  ev¬ 
idence  when  making  any  judgment  about  the 
data.  This  includes  the  use  of  data  from  mul¬ 
tiple  sensors  and  the  use  of  robust  statistical 
techniques. 

This  project  is  in  the  early  stages  of  develop¬ 
ment,  and  there  are  many  possible  extensions 
and  improvements  to  the  approaches  presented 
here.  Data  from  other  sensor  types,  such  as 
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Figure  4:  The  shape  primitive  database  used  in  modeling  the  rooftops  detected  in  the  IFSAR 
image. 


SAR  or  multi-view  stereo  from  optical  data  may 
be  used  to  support  processing.  The  use  of  an 
adaptive  thresholding  window  in  the  extraction 
of  rooftop  points  (Section  3.1)  would  allow  for 
piecewise  -  planar  ground  surfaces,  a  reasonable 
assumption  when  dealing  with  building  sites. 
The  addition  of  new  surface  primitives  to  the 
model  selector’s  data  base  (Section  3.3)  will  be 
useful,  given  that  such  models  are  separable. 
A  more  important  improvement  is  the  exten¬ 
sion  of  our  modeling  capabilities  to  rooftops  of 
greater  complexity  through  the  ability  to  com¬ 
bine  multiple  surface  primitives  in  the  fit  of  a 
single  rooftop  region.  Segmentation  in  the  el¬ 
evation  data  would  allow  a  single  primitive  to 
be  justifiably  fit  to  each  subregion  which  could 
then  be  merged  into  a  single,  composite  model. 
Such  mergings  can  be  accomplished  through  the 
use  of  constructive  solid  geometry. 
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Abstract 

A  method  for  detection  and  description  of  rectangular 
buildings  from  two  or  more  registered  aerial  intensity  images 
is  proposed.  The  ou^ut  is  a  3D  description  of  the  buildings, 
with  an  associated  confidence  measure  for  each  building.  Hi¬ 
erarchical  perceptual  grouping  and  matching  across  views  is 
employed  to  increase  the  robustness  of  the  system.  Verifica¬ 
tion  of  selected  building  hypothesxses  is  done  using  shadow 
and  wall  evidence  of  the  buildings.  The  system  is  largely  fea¬ 
ture-based.  Grouping  and  matching  are  performed  in  a  hier¬ 
archical  marmer,  utilizing  primitives  of  increasing  complexi¬ 
ty,  starting  with  line  segments  and  junctions,  and  proceeding 
to  higher  level  features.  Binocular  and  trinocular  epipolar 
constraints  are  used  to  reduce  the  search  space  for  matching 
features. 

1  Introduction 

Detection  and  description  of  building  structures  from 
aerial  images  is  becoming  increasingly  important  for  a  num¬ 
ber  of  applications  such  as  map-making,  change  detection 
and  databases  for  simrrlators.  This  problem  also  offers  an  op¬ 
portunity  of  exploring  the  issues  of  object  segmentation  and 
3-D  shape  inference  in  a  limited  setting  but  where  significant 
challenges  must  be  met.  While  this  problem  has  been  ap¬ 
proach^  with  use  of  just  a  single  image  ([5],  [7]),  multiple 
images  of  the  same  scene  are  often  available.  In  this  paper,  we 
assume  that  two  or  more  images  are  available  though  they 
may  not  be  taken  at  the  same  time  and  so  the  imaging  condi¬ 
tions  may  be  quite  different. 

The  task  of  detecting  and  describing  buildings  presents 
many  challenges.  In  a  single  image,  the  object  boundaries  are 
typically  highly  fragmented  due  to  low  contrast,  occlusion 
caused  by  nearby  vegetation  and  by  smaller  structures  on  the 
roofs,  and  need  to  be  grouped  to  yield  the  desired  objects.  In 
our  work,  we  limit  the  buildings  shapes  to  be  rectilinear  (i.e. 
rectangular  or  compositions  of  rectangular  shapes)  to  aid  the 
task  of  organization.  However,  many  other  structures  such  as 
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roads,  sidewalks  and  parking  lots  can  also  give  rise  to  recti¬ 
linear  (sganizations  and  need  to  be  distinguished  from  the 
building  structures.  Availability  of  multiple  images  allows  the 
possibihty  of  doing  some  of  this  reasoning  in  3-D,  by  making 
correspondences  between  the  image  features.  This  task  too  is 
difficult  in  the  aerial  image  domain.  Area  correlation  methods 
are  likely  to  have  difficulty  as  the  viewpoints  can  be  widely 
separated,  the  images  are  taken  at  different  times  and  the 
building  roofs  have  limited  texture.  In  our  system,  we  choose 
to  match  features  instead. 

Figure  1  shows  two  views  of  a  scene.  Figure  2  shows  the 
line  segments  detected  in  the  images  in  Figure  1 .  These  views 
illustrate  some  of  the  difficulties  that  arise.  A  large  number  of 
line  segments  are  detected  but  only  a  few  coirespond  to 
boundaries  of  desired  structures.  Many  of  the  lines  are  paral¬ 
lel  to  each  other  and  hence  difficult  to  match  in  the  two  views 
unambiguously,  without  higher  level  context.  Also,  many 
rectangular  organizations  of  the  features  are  possible  if  frag¬ 
mentation  is  allowed.  In  addition,  poor  camera  calibration 
prevents  us  from  making  highly  accurate  3D  position  infer¬ 
ences,  which  complicate  the  task  of  higher  level  segmentation 
and  description. 

An  important  question  in  multiple  image  analysis  is  the 
level  at  which  image  features  should  be  corresponded.  Lower 
level  features,  such  as  edges,  are  easy  to  detect  but  are  highly 
ambiguous.  Higher  level  features,  such  as  surfaces,  are  easily 
matched  but  hard  to  detect  rehably  in  single  images.  Some 
systems  have  been  constructed  to  match  features  such  as 
junctions  ([1],  [10])  which  are  then  used  for  grouping.  Other 
systems  have  attempted  to  find  candidates  for  roof  boundaries 
and  match  them  or  verify  them  to  get  3-D  descriptions  ([8], 
[2]).  We  feel  that  matching  at  only  one  level  does  not  fully  ex¬ 
ploit  the  information  available  in  the  multiple  images  and  that 
rather  than  deciding  between  grouping  first  and  then  match¬ 
ing.  or  matching  first  and  then  grouping,  it  is  more  advanta¬ 
geous  to  interleave  the  two  processes  so  that  local  features  are 
matched  and  then  grouped  to  form  higher  level  features  in  a 
hierarchical  way.  While  hierarchical  approaches  have  been 
suggested  in  the  past  (for  example,  [4],  [6])  they  have  rarely 
been  implemented  for  scenes  of  complexity  considered  here. 

A  block  diagram  of  our  approach  is  shown  in  Figure  3. 
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Our  approach  is  to  first  form  hypotheses  for  building  roofs  as 
roofs  project  into  larger  areas  and  to  verify  the  hypotheses  by 
using  evidence  from  cast  shadows  and  visible  walls  (if  any). 
As  we  consider  only  rectilinear  buildings,  a  natural  hierarchy 
for  hypotheses  formation  is  that  of  lines,  junctions,  parallels. 
Us  (three  sides)  and  parallelograms  (roofs  project  into  paral¬ 
lelograms  since  the  imaging  distance  is  large  compared  to  the 
height  of  the  buildings).  Matching  at  one  level  is  used  to  form 
group  hypotheses  at  the  next  level.  We  maintain  multiple 
matches  at  each  level  and  resolve  them  only  when  sufficient 
information  becomes  available  at  the  higher  levels.  We  be¬ 
lieve  that  this  approach  not  only  provides  good  results  for  the 
building  detection  task  but  also  provides  a  model  for  more 
general  conditions. 

Our  system  has  been  tested  on  a  number  of  real  images. 
The  details  of  our  system  and  some  results  are  shown  in  the 
following  sections. 


Figure  1  Two  views  of  a  scene 


2  Hierarchical  Grouping  and  Matching  of 
Features 

The  system  is  built  to  handle  two  or  more  views  non- 
preferentially.  A  hierarchy  of  features  is  used.  Starting  from 
the  most  primitive  these  are  lines,  junctions,  parallels,  U  s  and 
parallelograms  (because  the  projection  of  a  rectangle  is  a  par¬ 
allelogram  in  general,  assuming  negligible  perspective  ef¬ 
fects),  in  each  image.  Grouping  and  matching  is  performed  at 
each  stage.  Below  we  explain  which  features  in  the  hierarchy 
were  chosen  for  matching  purposes. 

2.1  Lines 

Lines  detected  using  the  Canny  edge-detector  are 


matched  across  all  views.  Multiple  matches  are  retained  at 
this  stage.  The  constraint  used  in  matching  is  the  quadrilateral 
constraint  described  in  [9]. 


Figure  3  Block  Diagram  of  the  System 


2.2  Junctions 

Next,  binary  junctions  (formed  by  the  intersection  of  ex¬ 
actly  two  lines)  are  formed  and  matched.  The  constraints  for 
junction  matching  are  the  epipolar  constraint,  the  line-match 
constraint,  the  3D  Orthogonality  constraint,  the  3D  height 
constraint  and  the  trinocular  constraint.  These  are  outlined  in 
[9]. 

23  Parallels 

Next,  we  search  for  parallels  and  their  matches.  Parallels 
are  formed  between  pairs  of  lines  in  the  same  view  that  are 
separated  by  the  less  than  the  maximum  width  of  a  building. 
A  match  is  hypothesized  if  there  is  evidence  iii  at  least  two 
views.  The  parallel  match  constraint  described  in  [9]  is  used 
to  remove  parallel  matches  that  cannot  lead  building  hypoth¬ 
eses  that  are  planar  in  3D. 

2.4  U’s 

U’s  are  formed  by  an  alignment  of  two  junctions,  or  by  a 
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parallel  that  has  closure  evidence  near  one  of  its  ends.  The 
presence  of  a  U  is  a  strong  indication  of  a  fragment  of  a  par¬ 
allelogram  in  2D  (implying  a  possible  rectangular  fragment 
in  3D).  [9]  contains  details  of  the  implementation,  and  the 
constraint  applied  at  this  stage. 

2.5  Parallelograms 

Formation  of  parallelograms  is  the  basis  for  hypothesiz¬ 
ing  buildings.  The  existence  of  evidence  to  form  a  parallelo¬ 
gram  match,  is  a  strong  indication  that  a  rectangular  3D  struc¬ 
ture  exists. 

3  Selection  of  Building  Hypotheses 

The  parallelogram  matches  serve  as  roof  hypotheses,  and 
are  equivalent  to  having  a  3D  model  of  the  buildings.  Owing 
to  the  resolution  of  the  images,  and  the  large  errors  in  trian¬ 
gulation  from  small  errors  in  the  images,  additional  process¬ 
ing  needs  to  be  done  to  distinguish  which  hypotheses  are 
buildings  or  parts  thereof,  and  which  are  rectangular  areas  on 
the  ground.  This  necessitates  a  selection  procedure.  The  se¬ 
lection  procedure  uses  four  criteria  to  decide  which  hypothe¬ 
ses  should  remain  for  verification,  namely  the  3D  height  of 
the  building,  positive  and  negative  line  evidence,  and  junction 
evidence.  Details  are  given  in  [9].  The  results  after  the  selec¬ 
tion  procedure  are  shown  in  Figure  4 


Wall  evidence.  In  a  view  which  is  not  nadir  one  or  more 
of  the  side  walls  of  the  buildings  should  be  visible.  These 
walls  are  assumed  to  be  vertical.  The  details  of  verification 
for  walls  are  included  in  [9]. 

Shadow  evidence:  When  shadow  lines  are  present,  they 
are  used  to  boost  the  confidence  of  the  hypothesis.  In  case  of 
the  Fort  Hood  images,  shadow  evidence  is  often  a  major  fac¬ 
tor  in  the  verification  of  a  selected  hypothesis. 

The  evidence  of  shadows  and  walls  is  combined  in  a 
Bayesian  manner,  with  a  priori  probability  estimates  obtained 
from  the  expected  length  of  the  shadow  (wall)  line). 

The  combined  score  from  the  wall  evidence  and  shadow 
evidence  is  thresholded  to  obtain  rectangular  building  (or 
building  component,  in  the  case  of  non-rectangular  rectilinear 
buildings)  hypotheses. 

5  Combination  of  Rectangular  Buildings 

Rectilinear  buildings  can  be  decomposed  into  rectangu¬ 
lar  components.  Verified  rectangular  hypotheses  are  exam¬ 
ined  for  combination  according  to  two  mutually  exclusive  cri¬ 
teria:  proximity,  and  overlap.  The  precondition  for  both  crite¬ 
ria  is  that  the  hypotheses  be  of  approximately  the  same  height 
in  3D. 

6  Results 


Figure  4  Selected  hypotheses  from  Figure  1 
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Figure  5  Results  after  combination  from  Figure  1 


4  Verification  of  Building  Hypotheses 

It  may  be  noted  that  so  far  the  evidence  that  was  used  was 
concerning  the  roof  only.  The  presence  of  lighting  causes 
shadows  to  be  cast.  When  the  view  is  oblique,  some  vertical 
sides  of  the  walls  of  the  building  may  be  visible.  These  cues 
are  used  to  verify  the  selected  hypothesis,  and  further  reduce 
the  number  of  hypotheses  based  on  the  available  evidence. 
The  numerical  evidence  for  the  walls  and  shadows  is  accumu¬ 
lated  for  all  the  views,  and  the  average  is  compared  against  a 
threshold.  These  monocular  cues  are  extremely  important 
when  the  registration  has  errors  large  enough  to  cause  height 
estimates  to  be  unreliable.  This  is  the  case  with  the  Fort  Hood 
images. 


We  have  used  images  from  Ft.  Hood,  Texas,  for  our  ex- 
periments.This  dataset  was  acquired  for  the  U.S.  government 
sponsored  RADIUS  program  and  has  become  a  common 
“benchmark”  for  evaluation  of  building  detection  systems. 
This  site  is  challenging,  because  it  has  non-rectangular  build¬ 
ings,  vehicles  are  present  on  the  roads  and  parking  lots,  and  it 
has  trees  and  grassy  areas.  Real  lighting  conditions  cause 
shadows  that  are  not  necessarily  the  darkest  areas  in  the  im¬ 
ages.  Furthermore  the  acquisition  geometry  is  such  that  the 
epipolar  lines  between  many  pairs  of  views  are  almost  paral¬ 
lel  (within  5°)  to  one  of  the  sides  of  the  buildings  (in  at  least 
one  view)  at  the  site.  This  causes  height  estimates  to  be  less 
reliable  and  the  selection  process  less  certain. 

Figure  5  shows  the  results  obtained  from  the  images  in 
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Figure  1.  Both  buildings  in  this  example  are  described.  The 
example  indicates  how  the  combination  routine  may  be  recur¬ 
sively  applied  to  combine  rectangular  building  components 
into  a  rectilinear  building. 

Figure  6  shows  L-shaped  buildings,  with  markings  on 
the  roofs.  It  may  be  noted  that  the  structural  features  are  par¬ 
allel,  giving  rise  to  a  number  of  parallel  structures  that  the 
pro^am  has  to  disambiguate  between.  Figure  7  shows  T- 
shaped  buildings.  There  is  fairly  dense  vegetation  close  to  the 
buildings,  causing  shadow  evidence  to  be  obfuscated  in  some 
areas.  The  small  sections  of  the  building  near  the  top  edge  of 
the  image  are  not  verified  because  large  sections  are  occluded 
by  the  shadows  of  the  long  parts  of  the  building.  Figure  8 
shows  the  results  obtained  for  fairly  complex  I-shaped  build¬ 
ings.  It  may  be  noted  that  the  view  depicted  is  fairly  oblique. 
There  is  one  false  positive  in  this  example.  This  is  caused  by 
accidental  alignment  of  shadows  and  the  side  of  the  buildmg. 
Figure  9  shows  a  number  of  small,  low  buildings.  The  height 
estimates  for  these  buildings  are  extremely  unreliable  (the 
buildings  are  less  than  5m  high).  Shadows  and  walls  ^e  ex¬ 
tremely  important  in  this  case.  Some  of  the  small  buildings 
were  not  verified  because  there  was  not  enough  edge  evidence 
to  support  their  selection.  Figure  10  shows  the  performance 
on  extremely  complex  A-shaped  buildings.  Many  shadow 
and  wall  areas  of  rectangular  components  of  these  buildings 
are  occluded  by  other  parts  of  the  buildings.  This  example 
proves  that  the  system  can  work  when  significant  amounts  of 
evidence  do  not  exist.  Figure  11  shows  two  buildings  with 
wings.  This  illustrates  the  working  of  the  system  with  build¬ 
ings  with  few  features  on  the  roof  (which  might  create  prob¬ 
lems  for  area-based  systems).  Figure  12  shows  the  results  on 
some  multi-level  and  gabled-roof  buildings.  The  multi-level 
building  is  not  perceived  as  a  multi-level  building,  because 
the  difference  in  heights  of  the  levels  is  too  small  to  be  detect¬ 
ed  in  the  presence  of  the  registration  errors.  Fi^re  13  shows 
results  on  a  fairly  complex  area.  Most  of  the  buildings  are  de¬ 
tected  and  modeled  correctly,  inspite  of  occlusion  fi'om  vege¬ 
tation  and  poor  shadows.  Some  errors  or  deficiencies  can  also 
be  seen.  Some  components  of  multi-wing  buildings  are  not 
detected  because  of  missing  line  evidence,  such  as  the  com¬ 
ponent  labeled  A.  The  building  labeled  B  is  an  instance  of  a 
“true  negative”.  This  building  is  missed  because  of  occlusion 
by  shadows  of  the  neighboring  building,  and  the  low  height 
of  the  building  itself. 

We  have  processed  large  areas  of  the  “Motor  Pool  Area” 
of  the  Fort  Hood  images  as  shown  in  Figures  14,  15  and  16. 
Figures  15  and  16  are  reproduced  at  low  resolution  to  show 
the  large  sections.  Figure  14  shows  a  sub-section  at  the  reso¬ 
lution  at  which  data  is  processed.  These  results  were  obtamed 
by  using  the  depicted  view  with  one  other  overlapping  view. 
There  are  a  number  of  multi-wing  buildings,  flanked  by 
smaller  rectangular  buildings.  The  rooftops  of  these  buildings 
are  very  similar  photometrically,  to  the  ground.  None  of  these 
buildings  is  taller  than  15m.  Inspite  of  these  difficulties,  the 
system  reliably  finds  the  large  buildings  in  areas  where  the 
sides  of  the  buildings  are  not  highly  fragmented  owing  to  the 


similar  reflectance  properties  of  the  buildings  and  the  ground 
near  it.  It  performs  less  reliably  when  the  epipolar  lines  are 
parallel  to  the  sides  of  the  buildings  as  matching  these  Imes  is 
harder  than  when  the  lines  form  a  significant  angle  with  the 
epipolar  lines.  For  example,  the  building  labeled  C,  in 
Figure  14  is  inaccurately  modeled.  This  error  is  caused  by  ac¬ 
cidental  background  geometrical  formations.  Better  registra¬ 
tion  would  permit  higher  confidence  3D  height  estimates,  fa¬ 
cilitating  better  selection. 

Evaluation  of  the  system  is  performed  using  quantitative 
and  qualitative  criteria.  A  model  is  constructed  by  hand  for 
comparison.  A  building  is  declared  detected  if  its  roof  area 
overlaps  more  than  50%  of  a  roof  of  a  building  in  the  supplied 
model.  Quantitative  measures  of  the  performance  of  the  sys¬ 
tem  may  be  defined  as  foUows:  if  tp  is  the  number  of  true  pos¬ 
itive  hypotheses,  t„  is  the  number  of  true  negative  hypotheses 
and  fp  is  the  number  of  false  positive  hypotheses,  then  we  de¬ 
fine  the  detection  percentage  as  tp/(tp  + 1„),  and  the  branching 
factor  as  +  fp).  For  one  part  of  the  site  from  the  Motor 
Pool  Area  of  Fort  Hood,  TX,  (shown  in  Figure  15),  tp  was  51, 
t„  was  1 1,  and  fp  was  5.  For  another  part  of  the  site,  from  the 
Motor  Pool  Area  (shown  in  Figure  16)  tp  was  25,  tn  was  7  and 
fp  was  4.  Measures  of  the  number  of  pixels  that  are  correctly 
labeled  as  buUding  and  non-building  pixels  are  also  useful. 
They  are  obtained  by  comparison  with  the  supplied  mod- 
eLThese  measures  are  shown  in  Table  1. 

We  are  unable  to  compare  our  results  with  those  of  other 
researchers  directly  as  we  do  not  have  access  to  their  soft¬ 
ware.  We  can  compare  to  their  previously  published  results, 
however,  they  may  not  be  on  the  same  data  even  when  they 
may  have  used  the  Ft.  Hood  data  set  and  the  published  results 
are  necessarily  outdated. 


Figure  6  L-shaped  buildings 
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Figure  7  T-shaped  buildings 


Figure  8 1-shaped  buildings 
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Figure  9  Rows  of  small  buildings 


Figure  10  A-shaped  buildings 


Figure  1 1  Buildings  with  wings 


Figure  12  Gabled  roof  and  multi-level  buildings 
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Figure  13  Results  on  a  complex  area  of  Fort  Hood 


Figure  14  Section  of  the  Motor  Pool  Area  of  Fort  Hood,  TX 
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Figure  15  Section  1  of  the  Motor  Pool  Area  from  Fort  Hood,  TX 


Figure  16  Section  2  of  the  Motor  Pool  Area  from  Fort  Hood,  TX 
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Table  1: 


Section 

Detection 

Percentage 

Branching 
Factor 
y(tp + fp) 

Correct 
building  pixels 

Correct  non¬ 
building  pixels 

Section  1 

82.26% 

0.08929 

75.36% 

99.13% 

Section  2 

78.13% 

0.13333 

71.84% 

98.72% 

We  feel  that  the  results  we  show,  particularly  in 
Figure  13,  are  on  much  more  complex  examples  and  that  we 
have  processed  much  larger  number  of  buildings  than  report¬ 
ed  in  the  previous  literature  ([5],  [2],  and  [10]).  We  feel  that 
our  method  has  advantage  over  monocular  systems  such  as 
([5])  due  to  its  ability  to  infer  3-D  more  directly  and  of  having 
several  views  to  validate  the  hypotheses.  We  feel  that  our  sys¬ 
tem  can  perform  more  consistently  over  systems  such  as  ([2]) 
which  form  hypotheses  in  one  view  and  verify  in  others  by 
not  being  dependent  on  a  favored  view. 
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Abstract 

A  novel  and  efficient  ridge  and  ravine  detection 
method  is  given. 

1  Introduction 

Ridges  and  ravines  are  important  features  in  some 
image  analysis  tasks  and  represent  a  basic  topo¬ 
graphic  type  in  digital  terrain  data.  Several  methods 
have  been  proposed  to  recover  these  features,  but 
they  have  major  shortcomings  including  (1)  their 
sensitivity,  and  (2)  their  computational  cost  (usu¬ 
ally  as  a  result  of  fitting  a  polynomial).  We  describe 
here  an  approach  based  on  the  Laplacian  operator 
that  has  a  firm  theoretical  foundation  and  'which  is 
relatively  inexpensive  to  compute. 

Haralick[Haralick  and  Shapiro,  1992]  describes 
how  the  facet  model  approach  can  be  used  to  re¬ 
cover  ridges  and  ravines.  A  bicubic  polynomial  is 
fit  to  a  patch  in  the  image;  ridges  are  then  charac¬ 
terized  by  a  negative  second  derivative  across  the 
ridge  line  and  a  zero  first  derivative  in  the  same  di¬ 
rection.  The  only  difference  for  a  ravine  is  that  the 
second  derivative  across  the  ravine  is  positive.  (Har- 
alick’s  book  reviews  several  earlier  techniques  for 
ridge  and  ravine  detection;  note  that  Rosenfeld  and 
Kak  maintain  that  the  Laplacian  can  be  used  to  de¬ 
tect  lines.)  The  computational  cost  is  high  due  to 
the  ten  coefficients  that  are  computed  at  each  pixel. 

A  more  recent  technique  related  to  our  approach  is 
that  proposed  by  Gauch  and  Pizer[Gauch  and  Pizer, 
1993].  In  their  approach,  they  find  places  where 
the  “intensity  falls  off  sharply  in  two  opposite  di- 

This  work  was  supported  by  the  Defense  Advanced  Re¬ 
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rections.”  They  determine  curvature  extrema  of  the 
level  curves  of  the  image  in  order  to  achieve  this. 
Unfortunately,  their  calculation  requires  the  evalua¬ 
tion  of  a  large  polynomial  in  the  first-,  second-  and 
third-order  partial  derivatives  of  the  image,  where 
cubic  splines  are  used  to  calculate  the  partial  deriva¬ 
tives. 

2  Curl  Method 

Our  method  is  based  on  the  following  sequence  of 
observations  concerning  the  behavior  of  the  gradi¬ 
ent  in  the  neighborhood  of  a  ridge  or  ravine.  Figure 
1  shows  an  image  as  a  surface  in  3D  (this  is  a  subim¬ 
age  of  a  medical  image  with  intensity  bands).  The 
gradient  (see  Figure  2)  produces  vectors  on  the  side 
of  a  ridge  which  point  toward  the  ridge  and  which 
point  away  from  a  ravine.  Although  the  gradient 
can  be  analyzed  directly  to  determine  the  location  of 
ridges  and  ravines,  it  is  computationally  more  con¬ 
venient  to  do  the  following: 

•  Rotate  (locally)  each  gradient  vector  -90  de- 
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Figure  2:  Gradient  Vectors  of  Ridge  Image 


Figure  3:  Rotated  Gradient 


grees  about  the  out  of  image  axis. 

•  Calculate  the  curl  at  each  point  to  determine 
the  opposed  flow  that  exists  at  ridge  lines. 

•  Calculate  the  extremum  of  this  function  across 
the  ridge. 


Figure  3  shows  the  rotated  gradient  for  the  image  of 
Figure  1,  while  Figure  4  shows  the  extracted  ridge 
pixels. 

Now  that  the  ideas  should  be  clear,  we  give  a  for¬ 
mal  development  of  this  technique.  Let  the  image 
function  be  f{x,y).  Then  the  gradient  is: 

V/  =  fx{x,y)  ■  i  +  fy{x,y)  -J+O-k 


The  rotation  is: 

rot{\jf)  =  fy{x,y)  ■  i -  fx{x,y)  -J+O-k 


The  curl  of  this  is: 


curl  (rot  {yj  f)) 
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dx 
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Figure  4:  Extrema  of  Curl  across  Ridge 

=  0-i  +  0-j+  (—fxx  ~  fyy)  -  k 

which  is  just  the  negative  of  the  Laplacian. 

Finally,  a  principal  direction  of  curvature  for  a  ridge 
pixel  is: 

atan2{fxyjyy-  fxx) 

““  2 

as  well  as  a  +  f .  We  search  in  these  directions  to 
determine  that  the  pixel  is  a  local  maximum  across 
the  ridge.  Note  that  ravines  can  be  found  in  a  sim¬ 
ilar  way,  as  they  are  (negative)  minima.  (Also,  it  is 
possible  that  vectors  of  different  strengths  pointing 
the  same  direction  give  rise  to  a  response;  they  can 
be  easily  filtered  if  necessary.) 

3  Summary 

We  believe  that  the  curl  of  the  rotated  gradient  pro¬ 
vides  an  excellent  basis  upon  which  to  construct  a 
robust  ridge  and  ravine  detector.  It  is  comparable  to 
existing  operators,  but  much  less  costly.  We  have 
tried  this  technique  on  a  number  of  types  of  images 
and  found  the  results  to  be  very  good. 
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Abstract 

Current  methods  for  generating  terrain  databases  do 
a  poor  job  of  integrating  drainage  features  into  the 
terrain  skin,  and  in  arid  areas  leave  out  features  such 
as  ravines  and  dry  washes  altogether.  This  paper 
demonstrates  a  way  to  automatically  extract  ravines 
using  a  combination  of  hydrological  analysis  and 
image  understanding  techniques.  A  physically  real¬ 
istic  hydrological  analysis  is  run  on  30-meter  DEM 
data  to  generate  initial  estimates  of  ravine  loca¬ 
tions,  which  are  then  fine-tuned  by  an  active  contour 
(snake)  method.  The  approach  builds  on  standard 
tools,  rather  than  being  implemented  as  a  stand¬ 
alone  system. 

1  Introduction 

A  key  use  of  simulation  and  terrain  database  sys¬ 
tems  by  the  military  is  to  conduct  training  exercises 
and  to  plan  missions.  Currently,  however,  there  is 
a  notable  lack  of  micro-terrain  features  (features 
smaller  than  the  scale  of  the  terrain  source  data) 
which  can  have  a  large  impact  on  the  utility  of  the 
system.  In  this  paper,  we  describe  initial  experimen¬ 
tation  with  a  method  to  extract  ravine  micro-features 
from  data  comparable  to  that  available  in  the  NIMA 
data  inventory,  augmented  by  georegistered  aerial 
imagery. 

Our  method  is  primarily  designed  to  be  used  on 
arid  geographic  regions,  where  ravines  (aka  “dry 
washes”  or  “wadis”)  occur  due  to  erosion  of  the  soil 
during  infrequent  rainstorms.  These  ravines  typi¬ 
cally  occur  as  flat-bottomed  corridors  with  steeply 
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rising  sides  which  range  fi'om  a  fraction  of  a  me¬ 
ter  to  two  to  three  meters  high,  though  some  can 
be  much  deeper.  As  such,  they  are  strategically 
significant,  as  they  can  be  used  by  dismounted  in¬ 
fantry  as  high-speed,  concealed  avenues  for  move¬ 
ment,  while  movement  across  them  can  be  particu¬ 
larly  difficult  for  mechanized  vehicles. 

The  primary  method  used  in  obtaining  our  goal  of 
extracting  ravine  micro-features  is  to  use  the  com¬ 
bination  of  a  hydrological  analysis  on  the  DEM  (es¬ 
timating  where  water  flows  in  this  area),  combined 
with  a  vision-based  analysis  of  the  orthophoto  (ver¬ 
ifying  where  ravines  actually  are).  Since  the  washes 
are  mostly  cut  from  erosion  due  to  water,  the  hydro- 
logical  analysis  hypothesizes  likely  wash  locations. 
However,  the  hydrological  analysis  alone  is  insuffi¬ 
cient  to  either  confirm  the  actual  existence  of  such 
features  or  localize  them  accurately,  because  of  the 
restricted  depth  and  small  width  of  the  ravines  rela¬ 
tive  to  the  resolution  with  which  terrain  elevation 
is  extracted.  Similarly,  while  ravines  are  usually 
at  least  partially  visible  on  aerial  photographs,  ac¬ 
curate  detection  and  localization  is  difficult.  The 
one  meter  orthophoto  contains  “too  much”  infor¬ 
mation,  and  ravines  are  easily  confused  with  roads, 
tracks  and  other  structures  commonly  appearing  in 
a  desert.  However,  reliable  predictions  of  exact 
ravine  locations  are  possible  by  using  the  hydrolog¬ 
ical  analysis  as  an  initial  estimate,  then  modifying 
this  estimate  using  an  active  contour  (snake)  method 
on  the  orthophoto.  The  combined  approach  detects 
only  the  features  of  interest,  while  simultaneously 
refining  the  estimated  location.  This  can  be  done  us¬ 
ing  single  views  and  does  not  require  redoing  com¬ 
plex  photogrammetric  calculations. 
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2  Background 

The  research  for  this  paper  is  based  upon  concepts 
in  GIS  and  hydrological  science,  and  concepts  in  the 
area  of  traditional  machine  vision. 

2.1  Hydrological  Analysis  for  Terrain 
Analysis 

The  earliest  ideas  on  using  DEM  data  to  find 
ravines  were  based  upon  using  local  surface  prop¬ 
erties  to  look  for  a  part  of  the  topographic  sur¬ 
face  that  is  locally  concave-upward,  and  mark 
this  position  as  a  valley  or  ravine,  presuming 
that  it  is  where  surface  water  runoff  is  likely  to 
be  concentrated  (e.g.,  [Peucker  and  Douglas,  1975, 
Chorowicz  et  al.,  1989,  Tribe,  1992]). 

Many  researchers  (e.g.,  [Mark,  1983, 
Jenson  and  Domingue,  1988])  have  used  a  method 
that  is  more  physically  justifiable  in  nature.  In 
this  method,  a  direction  is  assigned  to  each  cell 
of  the  DEM,  corresponding  to  the  direction  that 
water  would  flow  out  of  that  cell.  This  direction  is 
that  of  steepest  decent  (i.e.  one  of  the  8  compass 
directions  that  corresponds  to  the  steepest  downhill 
slope  from  that  cell).  Given  this  “direction  matrix”, 
the  total  number  of  cells  of  the  DEM  that  contribute 
drainage  through  each  cell  is  calculated.  Those  cells 
that  accumulate  drainage  above  a  certain  threshold 
are  considered  part  of  the  drainage  network. 

2.2  Active  Contours  for  Image  Analysis 

The  original  paper  on  active  contours  (or  snakes) 
was  in  1988  by  Kass,  Witkin,  and  Terzopoulos 
[Kass  etu/.,  1988].  In  it,  snakes  are  introduced 
as  energy-minimizing  splines,  whose  energy  is  a 
weighted  sum  of  internal  and  external  energies. 
There  are  two  different  internal  energies  which  may 
be  weighted  in  order  to  force  a  snake  to  act  more 
like  a  membrane  or  string,  in  the  sense  of  it  resist¬ 
ing  stretching,  or  more  like  a  thin-plate  or  rod,  in 
that  it  resists  bending  [Leymarie  and  Levine,  1993]. 
The  external  energy  equation  is  a  function  of  the 
image  on  which  it  is  acting.  This  equation  can 
be  specified  to  favor  various  image  properties, 
such  as  edges  and  lines.  Snakes  bring  to  bear 
a  high-level,  global  knowledge  across  the  entire 
curve,  instead  of  relying  solely  on  local,  low-level 
knowledge[Kass  et  al.,  1988,  Menet  et  al.,  1990]. 

In  recent  years,  much  research  has  been  di¬ 


rected  at  various  aspects  of  snakes,  including 
initialization  (e.g.,  [Berger  and  Mohr,  1990, 
Neuenschwander  eta/.,  1994]),  different  tm- 
derlying  representation  of  snakes  (e.g., 
[Menet  et  al. ,  1 990,  Wang  et  al. ,  1 996]), 

formulation  of  energy  functions  (e.g., 
[Radevae/a/.,  1995,  Lai  and  Chin,  1993]),  impos¬ 
ing  constraints  (e.g.,  [Neuenschwander  et  al,  1995, 
Fua  and  Brechbuehler,  1995]),and  the 

method  of  solution  (e.g.,  [Wang  e/ a/.,  1996, 
Amini  etal,  1990]). 

3  Approach 

In  our  research,  we  begin  with  a  hydrological  anal¬ 
ysis  of  a  DEM  to  form  estimates  of  ravine  locations, 
and  then  use  active  contours  to  attach  to  the  true  lo¬ 
cation  of  the  ravines.  We  have  tested  this  approach 
on  data  of  the  live  fire  range  (Range  400)  area  of 
the  USMC  Air  Ground  Combat  Center  in  Califor¬ 
nia,  covering  a  3.3km  by  2.2km  area.  Using  Range 
400  as  a  test  area  helps  with  evaluation,  since  terrain 
information  is  available  in  both  standard  DTED  for¬ 
mats,  and  in  a  high-quality  DEM  with  a  Im  post 
spacing  and  a  relative  vertical  accuracy  on  the  order 
of  0.1  to  0.3  meters  with  a  matching  orthoimage  at 
the  same  resolution.  The  results  presented  here  are 
based  on  a  30m  DEM  produced  by  downsampling 
the  Im  DEM  and  quantizing  elevation  values  to  the 
nearest  meter  in  order  to  simulate  typical  low  res¬ 
olution  data.  Results  based  on  DTED  level  2  data 
will  be  presented  in  subsequent  papers. 

Figure  1  shows  a  flowchart  depicting  the  process 
used  to  determine  ravine  locations.  Hydrological 
and  lU  components  provide  separate  sources  of  in¬ 
formation,  which  are  combined  using  snakes  to  pro¬ 
duce  a  final  estimate  of  ravine  locations. 

3.1  Hydrological  Analysis 

Hydrological  analysis  is  done  using  standard  tools 
in  the  widely  available  Arc/Info  geographic  infor¬ 
mation  system  (GIS).  Localization  of  drainage  pat¬ 
terns  is  aided  by  resampling  the  30m  DEM  at  a  10m 
resolution  using  the  Arc/Info  resample  command 
with  cubic  convolution  (i.e.,  smoothing  over  a  4x4 
patch).  We  then  use  the  Arc/Info  commands  fill, 
f lowdirection,  f lowaccumulation, con, 
and  streamline  to  run  a  physically  realistic  hy¬ 
drological  analysis  on  the  10m  interpolated  data,  in 
order  to  obtain  our  estimates  for  the  location  of  the 
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Im  orthophoto 


30m  DEM 


final  contours 


Figure  1:  Flowchart  ravine  location  process. 


ravines. 


Figure  2:  Aerial  image  of  210m  by  240m  region  of 
the  Range  400  area. 


any  need  to  merge  edge  elements  into  longer  con¬ 
tours  and  completely  decouples  the  scale  parameter 
associated  with  edge  detection  from  the  scale  pa¬ 
rameter  specifying  uncertainty  in  the  snake  initial¬ 
ization. 

The  hydrological  analysis  produces  a  tree  structured 
network  of  flow  contours.  Snakes  are  used  to  refine 
these  contours  by  alternating  between  two  optimiza¬ 
tion  processes; 


3.2  Active  Contours 

We  use  polyline  (as  opposed  to  B-spline)  snakes, 
solving  the  energy  equations  using  the  discrete  dy¬ 
namic  programming  method  introduced  by  Amini 
[Amini  etal,  1990].  In  this  method,  local  minima 
are  bypassed  in  favor  of  the  global  minimum  for  the 
snake,  and  the  snake  is  guaranteed  to  converge  to 
a  final  solution  (a  position  in  which  it  can  not  be 
modified  and  attain  a  lower  energy)  within  a  finite 
number  of  iterations  [Amini  etal,  1990].  The  hy¬ 
drological  analysis  provides  the  initial  estimate  for 
the  polyline  snakes.  The  external  energy  function 
is  created  using  a  potential  function  approach  based 
on  edge  elements  computed  from  the  aerial  imagery, 
rather  than  using  image  gradients  directly.  An  edge 
image  is  produced  in  which  pixels  corresponding 
to  all  detected  edge  elements  are  set  to  a  common 
value.  The  edge  image  is  then  blurred  using  a  Gaus¬ 
sian  kernel  with  a  standard  deviation  proportional  to 
the  expected  uncertainty  in  the  initial  position  esti¬ 
mates  from  the  hydrological  analysis.  This  avoids 


•  Individual  contours  are  optimized  with  respect 
to  the  external  energy  function  by  holding  their 
endpoints  fixed. 

•  Junctions  of  contours  are  optimized  with  re¬ 
spect  to  the  external  energy  function  by  si¬ 
multaneously  allowing  the  nearest  50m  of  each 
contour  associated  with  the  junction  to  move. 

4  Results 

This  section  describes  some  preliminary  results  ob¬ 
tained  on  data  of  the  Range  400  area.  The  area  that 
is  shown  in  the  figures  is  a  210m  by  240m  segment 
of  the  Range  400  area. 

The  image  analysis  was  started  using  the  Im  or¬ 
thophoto  shown  in  Figure  2.  A  zero-crossing  edge 
detector  was  run  on  this  image  to  produce  the  edges 
in  Figure  3.  These  were  then  blurred  using  a 
Gaussian  kernel  with  a  standard  deviation  of  3.0. 
This  result  is  shown  in  Figure  4.  The  30m  DEM 
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Figure  3:  Zero  crossing  edge  detector  applied  to 
image  in  Figure  2. 


Figure  4:  A  Gaussian  blur  of  the  edge  image,  with 
a  standard  deviation  of  3.0. 


data  was  interpolated  to  10m  posts  in  Arc/Info, 
then  run  through  the  Arc/Info  hydrological  analysis 
programs  to  arrive  at  the  initial  contour  estimates 
shown  in  Figure  5,  shown  here  for  purposes  of  il¬ 
lustration  overlayed  onto  a  shaded  relief  image  pro¬ 
duced  from  the  Im  DEM  data.  Figure  6  depicts 
the  final  ravine  contours  after  being  run  through  the 
snaking  programs  for  an  average  of  148  iterations 
per  contour. 

Notice  that  the  road  has  not  been  picked  out  as  a 


Figure  5:  Initial  ravine  estimates  based  on  hydro- 
logical  analysis  of  10m  DEM  data. 


Figure  6:  The  final  contours  after  fitting  with 
“snakes”. 


ravine,  which  an  image  analysis  technique  by  itself 
would  most  likely  handle  incorrectly.  Notice  also 
on  the  orthophoto  that  there  appears  to  be  a  num¬ 
ber  of  other  objects  that  appear  to  be  ravines,  be¬ 
sides  the  ravine  that  we  find.  Though  these  appear 
on  the  orthophoto  and  they  iriay  even  be  features 
produced  by  the  flow  of  water,  they  are  not  present 
in  the  high-detail  Im  DEM  and  are  thus  not  of  tacti¬ 
cal  importance.  Again,  an  image  analysis  technique 
alone  would  not  be  able  to  distinguish  between  these 


1004 


easily. 

These  results  on  a  subsection  of  the  Range  400  area 
demonstrate  the  potential  for  using  the  combination 
of  hydrological  techniques  on  low  resolution  DEM 
data  in  order  to  initialize  snakes  miming  on  high  res¬ 
olution  orthophoto  data.  We  believe  it  provides  a 
useful  technique  for  finding  and  localizing  the  loca¬ 
tion  of  micro-terrain  ravines  while  avoiding  the  pit- 
falls  (such  as  finding  spurious  contours)  of  an  image 
analysis  technique  alone. 
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Abstract 

This  paper  presents  a  procedure  for  quantitatively 
comparing  the  performance  of  two  image  under¬ 
standing  algorithms  using  real  images.  We  com¬ 
pare  each  algorithm’s  reconstmcted  points  with  the 
“ground  truth”  object  points  acquired  with  a  laser 
3-D  position  digitizer.  This  procedure  is  neces¬ 
sary  because  qualitative  comparisons  are  not  precise 
enough  to  identify  small  errors  and  synthetic  images 
do  not  properly  test  algorithms  that  are  intended  for 
use  on  real  imagery.  We  outline  the  procedure  we 
use  for  comparing  stereo  reconstruction  algorithms 
and  show  an  example. 

1  Introduction 

Much  discussion  has  been  given  over  to  the  need  for 
ways  to  evaluate  image  understanding  algorithms 
and  systems  in  an  objective  and  quantitative  manner 
[Sawhney  and  Hanson,  1990,  Weems  et  al,  1991, 
Firschein  et  al,  1993,  Holies  et  al,  1993].  A 
few  calibrated  datasets  are  now  available  (e.g.,  see 
[Willson  and  Shafer,  1992,  Thompson  and  Owen, 
1994,  Owen  et  al,  1996]),  but  reports  of  either 
quantitative  evaluations  of  lU  methods  or  the  pro¬ 
cedures  for  performing  such  evaluations  on  real  im¬ 
agery  are  still  rare. 

Informal,  subjective  measures  of  performance  can 
suffice  if  the  manifestation  of  errors  is  large.  If 
the  results  “look  wrong,”  they  probably  are.  Eval¬ 
uation  is  much  harder  when  errors  are  more  sub¬ 
tle  or  when  two  methods  with  similar  performance 
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need  to  be  compared.  Often,  synthetic  imagery  is 
used  in  such  situations  (e.g.,  [Barron  et  al,  1994]). 
However,  there  are  well  known  problems  with  us¬ 
ing  synthetic  imagery  to  evaluate  lU  methods  that 
are  intended  for  use  with  real  images.  This  paper 
presents  a  procedure  for  determining  the  relative 
performance  of  two  vision  algorithms.  Our  aim  is 
neither  to  draw  broad  conclusions  about  the  specific 
algorithms  considered  nor  to  deal  in  any  depth  with 
the  mathematical  issues  involved  in  quantifying  er¬ 
ror.  Rather,  we  simply  want  to  show  the  engineering 
steps  necessary  to  provide  the  framework  for  such 
activities. 

Evaluating  image  understanding  methods  that  deter¬ 
mine  scene  geometry  requires  that  the  “true”  geom¬ 
etry  be  independently  known.  Further,  there  must 
be  an  effective  way  to  compare  this  representation 
of  geometry  with  that  produced  by  the  algorithm 
under  investigation.  Determining  the  true  scene  ge¬ 
ometry  involves  either  using  objects  of  known  shape 
and  accurately  measuring  the  position,  or  scanning 
the  scene  with  some  sensing  device  presumed  to  be 
significantly  more  accurate  than  the  vision  system 
being  tested  [Owen  et  al,  1996].  In  the  procedure 
presented  here,  a  laser  scanner  is  used  to  measure 
scene  geometry  in  order  to  evaluate  how  accurately 
two  stereo  reconstruction  algorithms  are  able  to  es¬ 
timate  depth. 

2  Gathering  Data 

A  set  of  test  data  consists  of  the  image  pairs  needed 
by  the  stereo  algorithms  and  the  ground  truth  to 
compare  against.  Results  from  stereo  algorithms  are 
often  represented  as  a  camera-centered  depth  map. 
In  order  to  directly  compare  the  stereo  results  with 
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Figure  1:  Stereo  image  pair. 


the  true  geometry,  the  ground  truth  should  also  be 
rendered  into  a  camera-centered  depth  map.  Test 
data  could  potentially  be  acquired  using  two  cam¬ 
eras  capable  of  sensing  both  intensity  and  range,  but 
such  devices  produce  imagery  that  differs  in  both 
geometry  and  photometry  from  standard  cameras. 
We  chose  to  use  conventional  cameras  and  cam¬ 
era  calibration  techniques,  augmented  with  the  use 
of  a  DIGIBOT  II  laser  3-D  position  digitizer.  The 
DIGIBOT  takes  precise  measurements  of  3-D  posi¬ 
tion  points  on  the  surface  of  objects.  The  objects  sit 
on  a  platter  which  the  DIGIBOT  can  rotate  in  order 
to  scan  the  objects  from  all  angles.  Such  a  capabil¬ 
ity  is  essential  if  ground  truth  depth  maps  are  to  be 
produced  for  each  camera  viewpoint. 

The  first  step  in  the  process  of  gathering  test  data 
requires  a  calibration  target  to  be  placed  within  the 
workspace  of  the  DIGIBOT  so  it  is  visible  from  both 
cameras.  The  calibration  target  we  use  is  a  dark 
plastic  cube  75  mm  on  a  side,  with  24  white  dots  in¬ 
set  around  the  outside  of  each  face,  centered  7.5  mm 
from  the  edge  and  10  mm  from  each  other.  An  im¬ 
age  of  the  cube  is  captured  from  each  camera,  and 
3-D  position  points  on  the  surface  of  the  cube  are 
scanned  by  the  DIGIBOT.  The  dots  in  the  images 
can  easily  be  located  to  sub-pixel  accuracy.  From 
the  location  of  the  dots  and  the  known  geometry  of 
the  cube,  we  calculate  the  camera  models  in  a  coor¬ 
dinate  system  attached  to  the  calibration  cube  using 
standard  methods  for  optically  calibrating  cameras 
[Tsai,  1987].  Using  the  DIGIBOT  data,  the  position 
of  the  planar  faces  of  the  cube  in  the  DIGIBOT  co¬ 
ordinate  system  can  be  determined  with  high  pre¬ 
cision.  Combining  these  results,  we  can  transform 
between  DIGIBOT  coordinates  and  camera  coordi¬ 
nates. 

The  second  step  in  the  process  involves  replacing 
the  calibration  target  with  some  configuration  of 
objects  of  interest.  As  before,  stereo  images  are 
captured,  and  the  scene  objects  are  scanned  by  the 


DIGIBOT.  The  scanned  3-D  points  are  connected 
into  a  surface  representation  using  a  triangulated 
mesh  specified  in  terms  of  the  DIGIBOT  coordinate 
system.  This  mesh  is  transformed  into  the  camera 
coordinate  system.  It  can  then  be  projected  into 
each  camera’s  image  plane  using  standard  rendering 
techniques,  except  that  in  this  case  it  is  the  z  values 
that  are  rendered  and  not  intensity.  The  result  is  a 
per-pixel  depth  map  for  each  camera. 


Figure  2:  “Ground  truth”  range  images  for  left  and 
right  camera  views. 


3  Comparison  of  Stereo  Algorithms 

As  an  example,  we  compared  two  stereo  algorithms. 
Algorithm  A  used  the  method  described  in  [Smitley 
and  Bajcsy,  1984].  Algorithm  B  is  similar,  but  it 
skipped  some  steps  to  make  it  run  faster.  Using  the 
procedure  described  above  to  gather  test  data,  we 
compared  the  algorithms  to  see  how  much  quality  is 
lost  when  using  the  faster  algorithm.  The  stereo  im¬ 
ages  used  are  shown  in  Figure  1.  Each  stereo  algo¬ 
rithm  was  used  to  reconstruct  3-D  point  data  from 
the  images.  Though  we  have  not  done  so,  repeated 
experiments  could  be  used  to  determine  the  statisti¬ 
cal  significance  of  the  differences. 


Left  Image 

Algorithm  A 

Algorithm  B 

bad  matches 

804 

1372 

good  matches 
ave.  error  (mm) 

10138 

1.496 

10196 

1.510 

Right  image 

Algorithm  A 

Algorithm  B 

bad  matches 

844 

1051 

good  matches 

10093 

9779 

ave.  error  (mm) 

1.487 

1.494 

Table  1:  Performance  of  two  stereo  algorithms. 
Good  matches  have  less  than  5  mm  error. 


Range  values  from  the  two  stereo  reconstructions 
were  compared,  on  a  per-pixel  basis,  to  the  range 
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images  constructed  from  the  DIGIBOT  scan  (see 
Figure  2),  but  only  at  those  pixels  for  which  the 
stereo  algorithm  was  able  to  determine  a  match.  Re¬ 
constructed  depth  values  were  classified  as  a  good 
match  if  they  were  within  5  mm  of  the  true  value  and 
a  bad  match  otherwise.  The  quantitative  RMS  error 
associated  with  the  good  matches  and  the  number  of 
good  and  bad  matches  are  reported  in  Table  1 .  The 
distribution  of  good  and  bad  matches  is  shown  in 
Figure  3. 

Algorithm  A  is  clearly  better  then  Algorithm  B  (see 
Table  1,  Figure  3,  and  Figure  4).  Not  only  did  Al¬ 
gorithm  A  have  fewer  bad  matches,  it  also  had  more 
good  matches  and  a  slightly  lower  average  error  for 
the  good  matches. 

When  working  to  improve  RJ  algorithms  it  is  useful 
to  be  able  to  determine  when  one  version  of  an  al¬ 
gorithm  performs  even  slightly  better  than  another 
version  on  real  images.  The  procedure  described 
here  is  a  fairly  simple  way  to  do  such  comparisons. 


Algorithm  A  Algorithm  B 


Figure  3:  The  location  of  good  matches  (gray)  and 
bad  matches  (black). 
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Abstract 

This  paper  describes  plans  for  developing 
a  model-based  system  for  target  recogni¬ 
tion  in  foliage.  This  approach  is  bcised 
on  using  the  phenomenology  of  the  Fo¬ 
liage  Penetrating  (FOPEN)  sensor,  and  on 
resonance-based  features  for  recognition. 

This  paper  describes  the  key  components 
of  our  FOPEN  SAR  ATR  system,  the  re¬ 
search  problems  that  will  be  addressed,  and 
performance  evaluation  plans. 

1  Introduction 

We  describe  a  joint  research  effort  by  the  Univer¬ 
sity  of  Maryland  (UMD)  and  DEM  AGO,  aimed  at 
developing  a  model-based  ATR  system  for  recog¬ 
nition  of  targets  in  FOPEN  SAR  images.  Detec¬ 
tion  and  recognition  of  targets  in  FOPEN  SAR  is 
a  prime  example  of  ATR  in  very  low  signal/clutter 
ratio  situations.  Since  returns  from  clutter  can  be 
as  strong  as  those  from  targets,  and  the  clutter 
is  non-Gaussian,  the  detection  problem  poses  se¬ 
rious  challenges.  Also,  in  this  frequency  region, 
the  incident  wavelength  is  of  the  same  order  as 
the  size  of  the  body,  leading  to  a  resonant  be¬ 
havior  of  the  targets.  Attempts  to  recognize  tar¬ 
gets  in  foliage  have  used  one  or  another  form  of 
matching  to  templates  derived  from  target  reso¬ 
nances  [3,10,11,12,14,15].  Most  of  the  recent  target- 
resonance-based  approaches  have  been  bottom-up 
approaches  that  made  no  use  of  explicit  models 
or  model-derived  signatures.  In  programs  such  as 
MSTAR,  SAIP  and  RADIUS,  explicit  use  of  3- 
D  models  of  sites  and  targets  as  well  as  sensor 

This  project  wiU  be  supported  by  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  (ARPA  Order  No. 
E656)  and  the  U.S.  Army  Research  Laboratory.  For  fur¬ 
ther  information  see 

http://www.cfar. umd.edu/'akunuri/FOPEN/fopen.html. 


phenomenology  have  played  a  key  role  in  provid¬ 
ing  context  and  model-based  signatures  that  can 
be  subsequently  used  for  focus  of  attention,  clut¬ 
ter  mitigation  and  matching.  Such  model-based  ap¬ 
proaches  are  sorely  needed  for  the  FOPEN  SAR 
ATR  problem,  as  the  phenomenology  and  very  low 
signal/clutter  ratio  make  the  problem  even  more  dif¬ 
ficult  when  compared  to  the  high-frequency  SAR 
case. 

The  main  objective  of  this  project  is  to  study  a 
model-based  approach  to  recognition  of  targets  in 
foliage.  Specifically,  we  will  investigate  target- 
resonance-based  approaches  to  recognition.  A  key 
component  in  this  approach  is  the  creation  of  high- 
quality  synthetic  FOPEN  images  from  which  tem¬ 
plates  are  created.  Due  to  the  non-availability  of 
simulation  algorithms  for  FOPEN  signatures  that 
produce  acceptable  quality  [14,15],  templates  gener¬ 
ated  from  exemplars  are  often  used.  DEM  AGO  will 
develop  new  techniques  for  synthetic  generation  of 
radar  images  at  low  frequencies  (less  than  1  GHz) 
and  demonstrate  the  utility  of  predicted  signatures 
in  detecting  and  recognizing  targets  in  foliage. 

Our  model-based  FOPEN  SAR  ATR  framework  con¬ 
sists  of  four  key  components:  Focus  of  Attention 
(FOA),  FOPEN  SAR  target  and  foliage  signature 
prediction,  feature  extraction,  and  recognition.  The 
FOA  module  includes  algorithms  for  segmentation, 
labeling  and  detection.  The  segmentation  and  la¬ 
beling  component  will  divide  the  FOPEN  SAR  im¬ 
ages  into  clear,  medium  and  dense  foliage  regions. 
The  detection  algorithm  will  adaptively  threshold 
the  early-time  response  depending  on  whether  the 
target  is  in  a  clear  area  or  in  foliage.  The  predic¬ 
tion  module  will  generate  synthetic  images  of  tar¬ 
gets  buried  in  foliage.  The  matching  module  will 
use  the  predicted  signatures  for  matching  image- 
derived  resonance-based  features  to  model-derived 
features.  Resonance  is  one  of  the  peculiar  features 
of  the  FOPEN  imaging  modality,  whereby  an  early 
optical  time  response  is  always  followed  down-range 
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by  a  late-time  resonant  response  which  depends  es¬ 
sentially  on  the  target  dimensions. 

One  of  the  much-studied  approaches  to  the  recogni¬ 
tion  of  targets  in  FOPEN  images  is  the  method  of 
resonance  extraction  known  as  the  singularity  ex¬ 
pansion  method  (SEM)  [3],  a  contemporary  vari¬ 
ation  of  Prony’s  method.  Since  SEM  results  in 
poor  performance  in  the  presence  of  noise  and  multi- 
path  effects,  spectral  correlation  methods  have  been 
proposed  [14,15].  In  this  approach,  synthetic  and 
image-derived  resonances  are  transformed  to  a  spec¬ 
tral  domain  (Fourier,  DCT  or  wavelet  basis)  and  cor¬ 
related.  This  approach  has  been  tested  on  canoni¬ 
cal  targets  as  well  as  images  containing  CUCVs  and 
other  confuser  targets.  Both  targets  in  the  open  and 
targets  in  foliage  have  been  considered. 

One  of  the  key  issues  to  be  solved  in  this  approach  is 
the  generation  of  synthetic  reference  signals.  As  al¬ 
gorithms  for  predicting  resonances  for  complicated 
targets  are  not  yet  available,  reference  signatures 
have  been  generated  using  pole  patterns  through  the 
resonant  portions  of  the  CUCV  signatures  in  two  dif¬ 
ferent  images. 

Another  aspect  of  resonance-based  ATRfor  FOPEN 
is  the  claim  that  target  resonances  are  invariant  to 
aspect.  This  is  only  approximately  true.  This  is  be¬ 
cause  the  locations  of  the  poles  in  frequency  space 
are  fully  invariant,  but  the  residues  are  a  function 
of  illumination,  which  in  turn  is  a  function  of  aspect 
angle  relative  to  the  radar.  If  several  targets  are 
in  the  foliage,  then  matching  using  only  one  tem¬ 
plate/target  may  not  be  adequate  due  to  clutter 
corruption  and  signal  attenuation.  New  matching 
techniques  for  the  multi-target  problem  need  to  be 
developed. 

Our  planned  system  will  incorporate  the  following 
components: 

Prediction  of  low-frequency  SAR  signatures.  Until 
now  Xpatch  has  had  a  lower  frequency  limit  below 
which  it  was  not  effective.  The  existing  ray  tracing 
technique  in  Xpatch  will  be  improved  to  handle  fo¬ 
liage  scattering.  More  importantly,  DEMACO  will 
develop  novel  techniques  for  synthetic  generation  of 
target  and  foliage  signatures  at  low  frequencies,  us¬ 
ing  the  exact  Method  of  Moments  (MoM)  technique. 
This  component  alone  will  provide  new  approaches 
to  model-based  recognition  of  targets  in  foliage. 
Ongoing  SAR  ATR  efforts  at  UMD  heavily  utilize 
Xpatch-generated  signatures.  Xpatch  is  a  radar  sig¬ 
nature  prediction  code  developed  by  DEMACO  and 
is  based  on  a  high-frequency  asymptotic  method 
called  “shooting  and  bouncing  rays”.  For  comput¬ 
ing  SAR  images  of  military  targets,  Xpatch  has  been 
validated  for  the  frequency  range  of  S-band  and 
above. 


For  lower  frequencies,  however,  we  do  not  expect 
Xpatch  to  give  good  results  unless  it  is  supple¬ 
mented  by  new  techniques  suitable  for  describing 
low-frequency  resonant  phenomena.  Prediction  of 
SAR  images  of  targets  and  the  surrounding  clutter 
at  1  GHz  and  below  requires  both  high-  and  low- 
frequency  codes.  We  wish  to  compute  SAR  images 
of  ground  vehicles  in  a  clutter  environment  includ¬ 
ing  foliage  canopies.  In  terms  of  EM  wavelength, 
the  problem  has  two  scales:  large-scale  in  clutter 
and  small-scale  in  targets.  For  penetration  through 
canopies,  the  existing  ray-tracing  approach  is  sound 
as  long  as  proper  physics  associated  with  1  GHz 
wave  propagation  is  added  to  Xpatch.  The  net  ef¬ 
fect  of  the  canopy  is  to  distort  and  attenuate  the 
incoming  (exiting)  rays  from  the  radar.  Once  rays 
reach  the  ground  target,  a  different  phenomenon 
takes  over.  The  typical  size  of  a  ground  target  is 
10  meters  or  30  wavelengths  at  1  GHz.  Components 
on  the  target  that  account  for  the  return  in  a  SAR 
resolution  cell  are  on  the  order  of  one  wavelength. 
Hence  scattering  by  target  components  falls  into  the 
so-called  “resonance  region”,  which  cannot  be  ade¬ 
quately  described  by  the  high-frequency  techniques 
used  in  Xpatch.  Thus,  we  propose  to  attack  the 
low-frequency  target  scattering  problem  by  using  a 
new  exact  solver  code  called  Fast  Illinois  Solver  Code 
(FISC),  which  is  based  on  the  state-of  the-art  fast 
multipole  method.  Similarly  to  Xpatch,  FISC  com¬ 
putes  radar  signatures  and  images  from  a  complex 
target  represented  by  a  realistic  CAD  geometry  file. 
But  the  method  used  in  FISC  is  the  exact  MoM  in¬ 
stead  of  the  asymptotic  ray  tracing  technique. 
Model-based  false  alarm  reduction.  Despite  our  best 
efforts  to  optimally  set  detection  thresholds,  a  large 
number  of  false  alarms  occur  due  to  the  very  low 
signal/clutter  ratio.  Current  attempts  at  false  alarm 
reduction  use  empirical  target  images  obtained  from 
other  collections.  We  intend  to  use  predicted  sig¬ 
natures  to  derive  target  attributes  that  can  be  ex¬ 
ploited  to  eliminate  non-target  clusters  which  often 
yield  false  alarms. 

Model-based  matching.  We  plan  to  exploit  the  pre¬ 
dicted  signatures  to  derive  templates  for  matching 
target  and  reference  features,  and  to  explore  spectral 
correlation  matching  and  other  techniques  based  on 
pole  profiles.  We  will  do  a  systematic  study  of  how 
quasi-aspect-invariance  of  target  resonances  will  im¬ 
pact  matching  complexity. 

Our  long-term  goal  is  to  study  a  model-based  ap¬ 
proach  to  the  FOPEN  SAR  ATR  problem.  Fig.  1 
shows  the  various  modules  including  Focus  of  At¬ 
tention,  feature  extraction,  prediction,  indexing, 
matching,  and  search  [5].  The  FOA  module  could 
propose  target  chips  to  the  feature  extraction  mod- 
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Figure  1:  FOPEN  SAR  ATR  system  architecture 


ule.  Since  low-frequency  EM  phenomenology  is  very 
different,  issues  related  to  feature  extraction  are  still 
unresolved.  Likewise,  the  prediction  and  matching 
modules,  although  they  will  serve  the  same  roles  as 
their  counterparts  in  the  MSTAR  program,  will  be 
significantly  different  owing  to  the  low  frequencies 
employed  and  the  foliage  that  must  be  taken  into 
account.  Also  different  types  of  invariance  come  into 
play  for  features  based  on  target  resonances. 

Our  FOA  module  will  be  based  on  our  research  ef¬ 
forts  with  the  Army  Research  Laboratory  in  the 
context  of  the  Federated  Laboratory  Advanced  Sen¬ 
sor  Consortium.  The  next  section  of  this  paper  de¬ 
scribes  the  various  system  components  and  open  re¬ 
search  issues,  and  discusses  our  performance  evalu¬ 
ation  plans. 


2  Open  Research  Issues  and 
Critical  System  Components 

UMD  and  DEMACO  propose  to  undertake  a  sys¬ 
tematic  study  of  lU  approaches  to  FOPEN  SAR. 
Our  work  is  based  on  four  research  problems:  Focus 
of  attention  for  FOPEN  SAR  images,  prediction  of 
model-based  target  signatures  at  FOPEN  frequen¬ 
cies,  model-based  false  alarm  reduction,  and  match¬ 
ing. 


2.1  Prediction  of  low-frequency  SAR 
signatures 

DEMACO  will  undertake  tasks  involving  the  pre¬ 
diction  of  target  resonances  using  MoM.  By  using 
MoM,  the  EM  problem  is  reduced  to  solving  an  N- 
by- A  matrix  equation,  where  N  is  the  number  of  un¬ 
known  current  samples  induced  on  the  target.  For 
an  airplane,  N  is  typically  10,000  at  0.1  GHz,  and 
it  increases  with  the  frequency  squared,  which  leads 
to  two  important  conclusions:  (a)  Within  current 
computer  resources,  it  is  very  difficult  to  use  MoM 
for  military  targets  beyond  0.1  GHz  for  SAR  image 
simulations,  (b)  An  efficient  MoM-based  code  must 
have  a  good  algorithm  for  solving  the  matrix  equa¬ 
tions  rapidly  for  many  frequencies. 

The  standard  algorithm  for  solving  matrix  equations 
is  LU  Decomposition  (LUD).  LUD  is  an  0{N^)  al¬ 
gorithm.  In  1990,  Rokhlin  [13]  introduced  FMM  for 
solving  matrix  equations  derived  from  MoM  (see  also 
[6]).  This  was  considered  a  major  breakthrough  in 
low-frequency  techniques.  The  FMM  idea  is  to  first 
divide  the  subscatterers  into  groups.  Then,  the  addi¬ 
tion  theorem  is  used  to  translate  the  scattered  fields 
of  different  scattering  centers  within  a  group  into 
a  single  center  (aggregation).  Hence,  the  number 
of  scattering  centers  is  reduced.  Similarly,  for  each 
group,  the  field  scattered  by  all  the  other  group  cen¬ 
ters  can  be  first  “received”  by  the  group  center,  and 
then  “redistributed”  to  the  subscatterers  belonging 
to  the  group  (disaggregation). 

It  has  been  proved  that  the  computational  cost  of 
two-level  FMM  is  of  order  0(A^  ®)  [6].  Numerical 
simulations  also  show  that  the  complexity  is  of  order 
0(A^  ®)  [16,17].  A  further  improvement  in  FMM  is 
the  Multilevel  Fast  Multipole  Algorithm  (MLFMA) 
developed  at  the  University  of  Illinois.  To  imple¬ 
ment  MLFMA,  the  entire  object  is  first  enclosed  in 
a  large  cube,  which  is  partitioned  into  eight  smaller 
cubes.  Each  subcube  is  then  recursively  subdivided 
into  smaller  cubes  until  the  edge  length  of  the  finest 
cube  is  about  half  a  wavelength.  Cubes  at  all  levels 
are  indexed.  At  the  finest  level,  we  find  the  cube 
in  which  each  basis  function  resides  by  comparing 
the  coordinates  of  the  center  of  the  bcisis  function 
with  those  of  the  center  of  the  cube.  We  further  find 
nonempty  cubes  by  sorting.  Only  nonempty  cubes 
are  recorded  using  tree-structured  data  at  all  levels 
[2,9].  Thus,  the  computational  cost  depends  only  on 
the  nonempty  cubes. 

The  basic  algorithm  for  matrix- vector  multiplication 
is  broken  down  into  two  sweeps  [1].  The  first  sweep 
consists  of  constructing  outer  multipole  expansions 
for  each  nonempty  cube  at  all  levels.  The  second 
sweep  consists  of  constructing  local  multipole  expan¬ 
sions  contributed  from  the  well-separated  cubes  at 
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all  levels.  When  the  cube  becomes  larger  as  one 
progresses  from  the  finest  level  to  the  coarsest  level, 
the  number  of  multipole  expansions  should  increase. 
In  the  first  sweep,  the  outer  multipole  expansions 
are  computed  at  the  finest  level,  and  then  the  ex¬ 
pansions  for  larger  cubes  are  obtained  using  shifting 
and  interpolation. 

At  the  coarsest  level,  the  local  multipole  expan¬ 
sions  contributed  from  well-separated  cubes  are  then 
calculated  in  the  second  sweep.  The  local  expan¬ 
sions  for  smaller  cubes  include  the  contributions 
from  the  parent  cube  and  from  cubes  that  are  well- 
separated  at  this  level  but  not  well-separated  at  the 
parent  level  [4].  At  the  finest  level,  the  contribu¬ 
tions  from  non-well-separated  cubes  are  calculated 
directly.  Since  only  nonempty  cubes  are  considered, 
the  complexity  of  MLFMA  is  reduced  to  O(NlogN), 
and  the  memory  requirements  for  MLFMA  are  of 
order  O(MogA).  Dembart  and  Yip  [7,8]  have  imple¬ 
mented  MLFMA  using  radiation  functions,  interpo¬ 
lation  and  filtering;  this  implementation  has  com¬ 
plexity  0{N  log^  N). 

For  large  N,  the  advantage  of  MLFMA  over  LUD 
is  very  dramatic.  Let  us  illustrate  this  by  an  ex¬ 
ample.  Consider  an  airplane  whose  dimensions  are 
15x9x4  meters.  At  0.1  GHz,  a  matrix  equation  with 
N=25,000  is  set  up  by  MoM,  from  which  we  compute 
its  RCS  for  182  different  incidence  angles.  Results 
calculated  by  FISC  using  one  processor  on  an  SGI 
Challenge  (25  MFLOPS)  and  results  measured  by 
EMCC  (the  ElectroMagnetic  Code  Consortium)  are 
in  excellent  agreement.  The  memory  requirements 
and  speed  of  FISC  and  LUD  are  compared  in  the 
following  table: 

Speed  (Hrs)  Memory(MB) 
FISC  45  167 

LUD  400  5,200 

Savings  factor  9  31 

We  cannot  find  a  computer  with  5.2  GB  of  mem¬ 
ory  without  going  to  a  very  large  supercomputer,  so 
the  above  data  for  LUD  are  only  an  estimate.  This 
example  illustrates  the  big  benefit  of  FISC,  namely, 
it  makes  MoM  practical  for  airplane  and  tank  scat¬ 
tering  problems  at  frequencies  below  1  GHz  using 
workstation-sized  machines. 

DEMACO  will  develop  methodologies  for  calculat¬ 
ing  SAR  images  of  realistic  ground  targets  in  clutter 
environments  at  1  GHz  and  below,  with  special  em¬ 
phasis  on  foliage  penetration.  The  approach  is  to 
use:  (a)  High-frequency  ray  tracing  techniques  in 
Xpatch  to  determine  attenuation  and  phase  distor¬ 
tion  of  incoming  (exiting)  radar  waves  due  to  clut¬ 
ter,  and  (b)  FISC,  low-frequency  MoM,  to  calculate 
target  returns. 


Specifically,  we  will  consider  the  following  problems: 
Hybridize  low-  and  high-frequency  contributions.  Let 
us  denote  the  scattering  contribution  from  clutter  by 
A  and  that  from  the  target  by  B.  The  total  contri¬ 
bution  is  not  simply  Afi-B.  How  to  combine  them 
without  double-counting  or  missing  interactions  is  a 
challenging  research  problem.  This  problem  will  be 
addressed  in  this  task. 

Speed  up  frequency  looping  in  FISC.  Unlike  high- 
frequency  codes,  all  low-frequency  codes  are  ineffi¬ 
cient  in  repeated  frequency  calculations.  For  FISC, 
a  change  of  frequency,  however  slight,  is  a  new 
run.  Thus,  to  form  a  SAR  image  with  32  frequency 
points,  the  computational  cost  of  FISC  is  32  times 
that  of  a  single-frequency  run.  This  deficiency  must 
be  improved  before  FISC  can  be  used  for  SAR  ap¬ 
plications. 

Improve  foliage  scattering  in  Xpatch.  The  existing 
ray-tracing  technique  in  Xpatch  models  each  leaf 
and  branch  of  a  tree  individually.  This  is  not  feasi¬ 
ble  for  a  large  cluster  of  trees  or  for  low  frequencies. 
A  statistical  model-based  on  random  medium  the¬ 
ory  and  measurements  will  be  researched  and  imple¬ 
mented  in  Xpatch  to  support  this  development.  This 
overall  research  and  development  effort  will  be  a  ma¬ 
jor  advance  in  expanding  current  computational  ca¬ 
pabilities  across  a  very  wide  band  of  frequencies.  Un¬ 
til  now  Xpatch  has  had  a  low  frequency  limit  below 
which  it  was  not  effective.  With  the  advent  and  fur¬ 
ther  development  of  FISC,  coupled  with  Xpatch,  the 
new  computational  technology  will  provide  a  wider 
frequency  range  for  FOPEN  ATR  research  and  de¬ 
velopment. 

2.2  Matching  and  recognition 

Several  challenging  problems  remain  to  be  solved. 
As  mentioned  earlier,  target  resonances  form  the  ba¬ 
sic  attributes  that  will  be  used  in  recognition.  Poles 
extracted  from  target  resonances  have  been  histori¬ 
cally  used  [3,10,11,12]  as  features  for  identification. 
More  recently  [14,15],  using  a  projection  of  target 
resonances  onto  a  transform  basis  set,  a  set  of  coeffi¬ 
cients  known  as  spectral  templates  has  been  derived. 
The  FOPEN  image  data  are  also  projected  onto  the 
same  basis,  and  the  resulting  spectral  template  is 
correlated  with  the  target  template.  We  propose 
to  evaluate  the  pole  and  the  transform  projection 
techniques  mentioned  above  using  the  model-derived 
signatures.  In  addition  to  avoiding  the  necessity 
of  deriving  reference  templates  from  other  empiri¬ 
cal  collections,  the  ability  to  use  model-derived  tem¬ 
plates  allows  us  to  rapidly  test  new  target  and  fo¬ 
liage  models.  Unlike  the  situation  in  FLIR  and  SAR 
ATR  problems  where  obscuration  and  concealment 
are  afterthoughts,  in  the  problem  under  considera- 
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Figure  2;  Segmentation  of  a  FOPEN  image  into  clear  and  foliage  regions 


tion,  targets  are  buried  in  foliage.  The  availability 
of  model-based  prediction  tools  as  envisioned  in  our 
work  enables  us  to  account  for  the  presence  of  fo¬ 
liage.  For  example,  it  has  been  observed  [15]  that 
in  the  presence  of  foliage  clutter,  in  addition  to  sub¬ 
stantial  clutter  corruption,  the  target  signature  is 
attenuated  by  several  dB.  Our  matching  techniques 
will  account  for  such  signature  changes. 

One  of  the  interesting  features  of  target  resonances 
is  that  they  are  quasi-aspect-independent.  In  the 
presence  of  clutter  since  different  amounts  of  cor¬ 
ruption  and  attenuation  may  occur  depending  on 
the  nature  of  the  clutter  and  the  targets,  more  than 
one  template  per  target  may  be  required.  We  will 
systematically  study  this  issue  for  single  and  mul¬ 
tiple  target  cases  using  data  collected  by  ARL.  As 
pointed  out  in  [18],  several  questions  remain  to  be 
answered  in  FOPEN  ATR  design.  These  concern  the 
distinctiveness  of  pole  patterns  or  spectral  templates 
among  the  targets  under  consideration,  the  stability 
of  resonances  during  operation,  and  the  sensitivity 
of  polarization  and  multipath  effects. 

An  example  of  segmentation  into  clear  and  foliage 
areas  used  in  our  FOA  scheme  is  shown  in  Fig.  2. 
Fig.  3  shows  the  positions  of  detected  and  classified 
targets  using  a  spectral  matching  technique. 

2.3  Performance  evaluation 

The  evaluation  metrics  to  be  used  for  FOPEN  SAR 
ATR  are  the  probability  of  recognition  and  false 
alarm  rate  per  square  km  for  the  detection  modules. 
For  recognition  we  will  produce  confusion  matrices. 
The  data  collected  by  ARL  researchers  has  ground 
truth  information  on  target  types,  locations,  articu¬ 
lation,  and  foliage,  which  will  be  useful  for  evaluat- 


Figure  3:  An  example  of  target  detection  in  FOPEN 
images  [14,15] 


ing  the  detection  and  recognition  algorithms. 

The  evaluation  of  MoM  and  other  GEM  codes 
has  been  relatively  straightforward  because  the 
DoD/NASA  Electromagnetic  Code  Consortium 
(EMCC)  has  developed  and  continues  to  develop 
better  and  better  test  cases.  Using  stereo  lithog¬ 
raphy,  an  accurate  CAD  model  for  a  VFY218  scale 
model  aircraft  has  been  built.  The  EMCC  has  mea¬ 
sured  this  aircraft  in  an  anechoic  chamber  from  2 
to  35  GHz,  providing  for  an  equivalent  frequency 
range  of  70  MHz  to  approximately  1  GHz.  This 
wideband  data  is  very  useful  in  validating  the  FISC 
code.  In  addition,  this  geometry  is  a  benchmark  for 
EM  codes,  and  it  will  be  possible  to  compare  this 
code’s  results  to  other  codes  when  they  are  capa¬ 
ble  of  running  a  1  GHz  equivalent  test.  Currently 
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FISC  is  the  only  code  available  that  can  run  this 
problem  on  a  workstation  in  reasonable  time.  In  ad¬ 
dition,  Wright  Laboratory  and  others  have  extensive 
chamber-measured  as  well  as  full-scale  vehicle  data 
on  the  M-35  truck;  we  will  utilize  this  data  for  the 
full  evaluation  of  the  code  for  ground  vehicles.  Also 
present  in  this  data  set  is  the  M-35  in  varying  fo¬ 
liage,  both  in  the  anechoic  chamber  as  well  as  at  full 
scale. 

3  Conclusion 

This  paper  describes  our  research  plans  whose  ulti¬ 
mate  goal  is  the  design  of  a  system  for  the  exploita¬ 
tion  of  FOPEN  SAR  imagery.  Our  system  will  em¬ 
phasize  model-based  detection  coupled  with  effective 
target  signature  prediction. 
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Abstract 

This  report  summarizes  our  research  plans  and 
direction  for  the  above-named  DARPA  lU  pro¬ 
gram.  The  program  begins  in  March  1997. 
We  discuss  the  research  objectives,  the  research 
questions  to  be  considered,  and  our  performance 
evaluation  process. 

1  Objectives 

Our  program  is  aimed  at  developing  improved 
feature  extraction  for  application  in  feature- 
based  automatic  target  recognition  (ATR)  sys¬ 
tems.  Our  focus  is  on  synthetic  aperture  radar 
(SAR)  ATR,  although  our  methods  also  apply 
to  non-SAR  radar  ATR,  such  as  high  range  res¬ 
olution  radar  ATR. 

Our  primary  research  objectives  are: 

1.  Develop  and  validate  physically- 
based  models  for  scattering  that  can 
be  used  for  model-based  ATR.  Our 
work  is  aimed  at  developing  attributed 
scattering  center  models  for  radar  scatter¬ 
ing.  These  models  are  founded  on  electro¬ 
magnetic  scattering  theory  (uniform  the¬ 
ory  of  diffraction  and  physical  optics).  We 

’This  work  was  sponsored  by  the  Defense  Advanced 
Research  Projects  Agency  under  contract  F33615-97- 
1020  monitored  by  Wright  Laboratory.  The  views  and 
conclusions  contained  in  this  document  are  those  of  the 
authors  and  should  not  be  interpreted  as  representing  the 
official  policies,  either  expressed  or  implied,  of  the  De¬ 
fense  Advanced  Research  Projects  Agency  or  the  United 
States  Government. 


base  our  models  on  scattering  theory  to  en¬ 
sure  that  the  derived  models  are  physically 
meaningful. 

Our  approach  is  to  distill  from  these 
electromagnetic  scattering  theories  models 
that  balance  between  fidelity  and  simplic¬ 
ity.  The  electromagnetic  theories  contaiu 
much  more  detail  of  scattering  behavior 
than  is  practical  for  ATR  applications.  Our 
goal  is  to  obtain  simplified  approximations 
that  contain  few  enough  parameters  to  be 
reliably  estimated  from  measured  data,  To 
do  so,  we  build  on  our  past  successes  as  de¬ 
scribed,  for  example,  in  [Potter  and  Moses, 
1997]. 

2.  Develop  practical  feature  estimation 
algorithms.  The  aim  is  to  derive  es¬ 
timation  algorithms  that  can  be  imple¬ 
mented  in  SAR  ATR  systems.  Such  al¬ 
gorithms  should  be  computationally  prac¬ 
tical,  should  determine  model  order  au¬ 
tonomously,  and  should  be  “hands  off”  al¬ 
gorithms,  requiring  little  or  no  hand-tuning 
by  the  user. 

Our  effort  will  be  focused  on  developing 
algorithms  that  estimate  model  parame¬ 
ters  from  complex-valued  SAR  imagery  in 
the  image  domain.  This  is  in  contrast  to 
many  existing  parameter  estimation  meth¬ 
ods  that  operate  on  SAR  phase  history 
data.  Image  domain  processing  is  more 
practical,  because  existing  ATR  systems 
operate  on  SAR  image  “chips”  that  have 
been  prescreened  by  front-end  processing. 
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In  addition,  we  will  consider  model  or¬ 
der  detection  and  image  segmentation  al¬ 
gorithms  needed  for  feature  extraction.  Fi¬ 
nally,  computational  speed  of  the  algo¬ 
rithms  is  of  importance;  we  will  aim  to 
develop  algorithms  with  good  statistical 
properties  while  minimizing  computation. 

3.  Measure  efficacy  of  these  features  for 
improvements  to  SAR  ATR  perfor¬ 
mance.  The  goal  is  to  quantify  the  ATR 
performance  improvement  afforded  by  our 
new  models  and  estimation  algorithms.  We 
will  quantify  improvement  first  by  deter¬ 
mining  the  uncertainty  of  the  estimated 
features.  Feature  uncertainty  is  needed  in 
Bayesean  evidence  accrual  used  for  scor¬ 
ing  of  candidate  target  hypotheses  to  the 
measured  target  data.  We  will  consider 
algorithm-independent  bounds,  such  as  the 
Cramer-Rao  bound,  as  well  as  algorithm- 
dependent  performance.  In  addition,  we 
will  quantify  ATR  detection  performance 
performance  using  the  feature  estimates  in 
conjunction  with  feature-based  match  scor- 
ing. 

The  research  program  builds  on  past  ATR  work 
funded  by  aDARPA  University  Research  Initia¬ 
tive.  Our  current  technical  approach  is  based  on 
this  past  work,  as  summarized  in  [Potter  and 
Moses,  1997,  Ying,  1996,  Chiang,  1996]. 

Our  program  is  of  value  to  battlefield  awareness 
because  we  improve  SAR  ATR  performance  in 
the  following  ways: 

1.  Richer  feature  set  for  ATR:  Our 
physically-relevant  models  characterize 
scattering  using  multiple  attributes  for 
each  scattering  center;  this  gives  an  in¬ 
creased  level  of  target  discriminability, 
which  improves  ATR  performance.  Richer 
features  are  especially  important  in  ex¬ 
tended  operating  conditions,  when  fewer 
target  scattering  centers  are  observable. 

2.  Decreased  feature  uncertainty:  Be¬ 
cause  our  models  are  physically  based,  we 
are  able  to  exploit  prior  information  in  the 

1020 


feature  extraction  stage  to  improve  resolu¬ 
tion.  We  can  thus  achieve  sub-pixel  accu¬ 
racy  and  superresolution  of  target  scatter¬ 
ing  phenomena.  The  smaller  the  feature 
uncertainty,  the  more  discriminable  are  tar¬ 
gets,  and  ATR  performance  improves. 

3.  Improved  match  scoring  metrics:  We 
will  develop  match  scoring  rules  that  are 
tailored  to  the  feature  extraction  algo¬ 
rithms  we  develop.  By  characterizing  fea¬ 
ture  uncertainty  and  coding  this  uncer¬ 
tainty  in  a  match  score,  we  will  further  in¬ 
crease  target  discriminability. 

The  above  performance  improvements  have  ap¬ 
plication  in  feature-based  ATR  systems,  such  as 
MSTAR.  We  are  closely  following  the  MSTAR 
program,  and  are  tailoring  the  research  program 
to  facilitate  insertion  of  our  developed  technol¬ 
ogy  into  MSTAR. 

2  Research  Questions 

In  order  to  achieve  the  above  research  objec¬ 
tives,  the  following  specific  research  questions 
will  be  addressed: 

1.  Determine  a  set  of  model  primitives 
that  balance  between  modeling  fi¬ 
delity  and  estimation  accuracy.  As 
we  mentioned,  electromagnetic  scattering 
models  are  often  too  detailed  to  be  of  prac¬ 
tical  use  for  ATR;  we  will  distill  from  this 
theory  models  that  contain  a  few  parame¬ 
ters  which  can  be  accurately  estimated,  and 
which  at  the  same  time  describe  scattering 
behavior  in  sufficient  detail  to  effectively 
discriminate  between  targets. 

We  will  characterize  achievable  estimation 
accuracy  as  a  function  of  system  parame¬ 
ters,  such  as  bandwidth,  center  frequency, 
and  signal-t-clutter  ratio. 

During  the  February  MSATR  interaction 
meeting,  discussions  with  researchers  for 
the  MSTAR  Predict  module  suggest  that 
our  proposed  model  primitive  set  can  be 
predicted  by  the  MSTAR  Predict  module. 
This  will  facilitate  transfer  of  our  research 
to  MSTAR  program. 


2.  Develop  fast,  automated  computa¬ 
tion  for  complex  SAR  image-domain 
processing.  We  will  develop  image  do¬ 
main  algorithms  because  nearly  all  SAR 
ATR  processing  streams  operate  on  image 
chips.  We  will  develop  model  order  detec¬ 
tion  methods  that  are  effective  for  SAR 
scattering  features,  to  reduce  sensitivity 
of  performance  on  model  order.  We  will 
develop  “hands-off”  algorithms  that  mini¬ 
mize  or  eliminate  user  tuning.  We  will  de¬ 
velop  algorithms  that  minimize  computa¬ 
tional  cost  while  maintaining  good  statisti¬ 
cal  feature  accuracy. 

3.  Implement  stand-alone  match  scor¬ 
ing  to  evaluate  target  discriminability 
and  feature  estimation  tradeoffs.  We 
will  develop  a  geometric  hashing-based  al¬ 
gorithm  for  match  scoring.  The  purpose 
is  twofold:  i)  to  consider  feature  extrac¬ 
tion  and  feature  match  scoring  in  a  tight 
loop  to  exploit  .synergy  for  improved  ATR, 
and  ii)  to  quantitatively  evaluate  feature 
extraction  performance  at  the  system  level 
as  target  detection  probabilities.  While  ge¬ 
ometric  hashing  is  effective  for  scattering 
center  locations,  a  research  question  is  how 
to  extend  the  procedure  for  other  scatter¬ 
ing  attributes. 

4.  Assess  the  potential  for  ATR  im¬ 
provement  via  superresolution  of  at¬ 
tributed  scattering  centers.  Some  ini¬ 
tial  results  in  superre, solution  on  SAR  im¬ 
agery  by  Lincoln  Labs  and  ERIM  suggest 
that  enhanced  resolution  improves  target 
discriminability  and  hence  improves  ATR 
performance.  Superresolution  is  an  active 
question  of  interest  in  the  MSTAR  pro¬ 
gram,  to  name  one.  Our  attributed  scat¬ 
tering  center  features  naturally  offer  super¬ 
resolution,  both  in  sub-pixel  accuracy  of  es¬ 
timated  scattering  centers,  and  in  ability 
to  extract  two  (or  more)  scattering  centers 
within  the  same  resolution  “bin”.  We  will 
explore  the  gain  in  performance  afforded  by 
superresolution  in  our  attributed  scattering 
center  models. 


3  Performance  Evaluation  Process 

Our  research  program  includes  a  plan  for  quan¬ 
titatively  measuring  performance.  Below  we 
summarize  this  plan  for  the  major  research 
goals. 

•  Model  validation:  The  goal  is  to  quan¬ 
tify  fidelity  of  attributed  scattering  center 
models.  We  will  use  target  scattering  pre¬ 
dictions  and  measurements  from  MSTAR 
of  geometric  primitive  data.  We  will  com¬ 
pare  percent  energy  in  our  extracted  model 
with  original  target  energy,  and  also  com¬ 
pare  detection  performance  of  scattering 
types  for  these  models. 

•  Estimation  accuracy:  The  goal  is  to 
quantify  estimation  accuracy  for  feature 
extraction.  We  will  first  use  scattering 
prediction  models  (MST.4R,  XPatch)  and 
measurements  of  geometric  primitives.  VVe 
will  corrupt  these  data  with  noise  and  ap¬ 
ply  our  feature  extraction  methods.  We 
determine  mean  and  covariance  of  estima¬ 
tion  errors  and  ROC  curves  for  scattering 
center  detection  versus  false  alarm  as  SNR. 
varies.  We  will  compare  these  values  with 
our  theoretical  predictions  of  feature  un¬ 
certainty.  We  will  then  validate  perfor¬ 
mance  on  MSTAR  public  release  SAR  im¬ 
agery.  Finally,  we  will  obtain  quantitative 
target  classification  performance  (detection 
performance  ROC  curves;  confusion  matri¬ 
ces)  from  MSTAR  public  release  imagery 
when  used  in  conjunction  with  our  match 
scoring  metric. 

A  goal  of  this  program  is  to  port  our  fea¬ 
ture  extraction  algorithms  to  Khoros  mod¬ 
ules  for  use  in  other  programs;  thus,  an¬ 
other  measure  of  progress  is  completion  of 
modules. 

•  Superresolution:  The  goal  is  to  quantify 
ATR  performance  improvement  afforded 
by  superresolution.  We  will  quantify  sub¬ 
pixel  accuracy  of  scattering  center  location 
estimates  as  above.  We  will  also  quan¬ 
tify  classification  performance  at  the  out¬ 
put  of  our  match  scoring  metric.  Using  a 
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combination  of  scattering  prediction  mod¬ 
els  at  higher  resolution  and  high-resolution 
measurements  as  “ground  truth”,  we  can 
quantify  attainable  sub-pixel  accuracy  by 
applying  feature  extraction  on  reduced- 
resolution  data. 
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Abstract 

This  project  focuses  on  the  complete  ATR  process¬ 
ing  chain,  from  sensor  signal  processing,  to  image 
analysis,  to  model-based  object  recognition.  It  uti¬ 
lizes  analysis  of  performance  characteristics  of  one 
component  to  model  input  characteristics  of  subse¬ 
quent  components,  and  leverages  the  constraints  of 
such  analysis  to  guide  the  development  of  the  entire 
chain  of  target  recognition.  The  overarching  theme 
of  our  approach  is  to  maintain  careful  statistical 
models  of  each  stage  in  conjunction  with  multires¬ 
olution  models  and  algorithms  to  derive  fundamen¬ 
tally  grounded  components  for  all  aspects  of  ATR 
processing. 

1  Introduction 

This  PI  Report  describes  work  conducted  under  a 
recently  completed  ATR  URI  grant,  as  well  as  work 
that  we  plan  to  conduct  under  a  newly  issued  grant 
as  part  of  DARPA  lU’s  IMEX/ATR  project. 

1.1  Objectives  —  at  the  top  level 

Automatic  target  recognition  (ATR)  is  at  a  point 
where  it  must  soon  yield  significant  military  advan¬ 
tage.  The  demands  of  modern  warfare  make  it  a 
necessity.  We  can  ill  afford  to  wait  as  our  opponents 
maneuver  a  SCUD  into  position  or  prepare  an  unex¬ 
pected  attack.  We  must  be  able  to  quickly,  automat¬ 
ically  and  reliably  scan  an  entire  theater  of  operation 
to  detect  the  whereabouts  and  intentions  of  weapons 
and  forces.  Unfortunately,  even  trained  operators 
find  target  recognition  a  difficult  task.  Clearly  there 
is  no  shortage  of  information.  The  quality  and  quan¬ 
tity  of  SAR  imagery  is  advancing  steadily,  as  are  the 

°This  report  describes  research  supported  in  part 
by  ARPA  under  ONR  contract  N00014-94-01-0994, 
by  AFOSR  grant  F49620-93-1-0604,  and  by  DARPA 
contract  95009-5381.  Pis  may  be  contacted  at 
welg@ai.mit.edu,  jhs@mit.edu,  viola@ai.mit.edu,  will- 
sky@mit.edu.  URL  for  the  project  available  at 
http ; //wHW . ai . mit . edu/ projects/ darpa/ atr/ 


vehicles  capable  of  using  it,  both  manned  and  un¬ 
manned.  Target  recognition  is  difficult  in  part  be¬ 
cause  of  this  unavoidably  large  volume  of  data.  As  a 
result,  information  on  the  target  comprises  a  vanish¬ 
ingly  small  part  of  the  totality  of  data,  perhaps  less 
than  one  part  in  ten  million.  To  make  matters  worse 
the  image  of  the  target  is  a  highly  complex  function 
of  the  sensor,  the  measurement  scenario  and  the  tar¬ 
get  itself.  ATR  is  a  search  for  a  needle  lost  in  a 
thousand  haystacks —  a  difficult  task  made  harder 
because  our  needle  looks  a  lot  like  a  piece  of  hay. 

Classically,  ATR  has  been  formulated  as  a  sequen¬ 
tial  process.  The  first  step  is  the  processing  of  raw 
sensory  data,  followed  by  computations  that  form 
images  from  this  data.  From  these  images  primi¬ 
tive  features  are  detected,  and  finally  a  target  recog¬ 
nition  algorithm  is  applied  to  these  features.  Each 
step  in  this  process  has  become  a  separate  area  of  re¬ 
search,  between  which  there  is  little  communication. 
For  example,  SAR  based  target  detection  strives  to 
be  independent  of  the  processing  used  to  form  SAR 
images  from  raw  data.  In  our  view,  revolutionary 
advances  in  ATR  require  more  than  enhanced  algo¬ 
rithms  for  each  of  the  component  functions  in  an 
ATR  system.  Rather,  it  requires  the  establishment 
of  a  unified  theory  of  ATR,  in  which  the  physics  of 
sensor  phenomenologies,  the  tools  of  statistical  esti¬ 
mation,  and  technology  from  image  understanding 
(lU)  are  combined.  Such  an  approach  promises  to 
(a)  make  optimal  use  of  the  information  embedded 
in  sensor  data;  (b)  yield  algorithms  that  are  robust 
to  the  uncertainties  and  variabilities  in  both  phe¬ 
nomena  and  data;  and  (c)  provide  a  clear  audit  trail 
for  the  evaluation  of  performance  and  for  pinpoint¬ 
ing  the  factors  limiting  performance  (e.g.,  is  a  better 
matching  algorithm  needed  or  is  the  problem  simply 
that  the  data  are  not  of  the  quality  needed  to  meet 
specific  performance  objectives). 

Our  previous  collaborative  research  under  the  cur¬ 
rent  ATR-URI  program  has  resulted  in  a  number 
of  accomplishments  and  outputs  along  these  lines. 
Among  these  are  a  variety  of  algorithms,  several 
transitions  and  applications  of  direct  relevance  to 
DARPA  programs,  and  the  establishment  of  very 
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close  working  relationships  with  a  number  of  key 
players  in  ATR  programs  and  ACTD’s.  However, 
of  equal  (and  perhaps  greater)  importance  is  the 
fact  that  our  experience  in  this  research  effort  has 
both  strengthened  our  belief  that  an  integrated  ap¬ 
proach  to  ATR  has  a  considerable  amount  of  promise 
and  also  sharpened  our  understanding  of  how  that 
promise  can  be  realized.  Our  research  builds  on  this 
foundation  and  offers  what  we  believe  is  a  cohesive 
set  of  research  ideas  that  involve  the  integration  of 
disciplines  spanning  the  entire  ATR  problem.  Per¬ 
haps  the  strongest  message  that  we  have  extracted 
from  our  previous  work  is  that  when  one  takes  a 
unified  look  at  an  ATR  problem,  the  challenges  pre¬ 
sented  at  each  stage  of  the  processing  chain  is  in¬ 
fluenced  significantly  by  its  placement  in  context, 
leading  to  new  problem  formulations  and  conceptual 
issues  that  have  not  been  considered  before.  Our 
objective  is  to  capitalize  on  these,  both  to  make  sig¬ 
nificant  contributions  in  domains  corresponding  to 
each  of  these  processing  steps  and  to  produce  new 
end-to-end  processing  structures. 

1.2  Objectives  -  detailed  examples 

While  our  current  research  suggests  a  variety  of 
problems  worthy  of  attention,  we  can  point  to  two 
specific  contexts  that  exemplify  both  the  contribu¬ 
tions  and  foundation  that  our  current  effort  has  pro¬ 
vided  and  some  very  concrete  new  problem  formu¬ 
lations  that  result  from  this  unified  view: 

1.  Multiresolution  SAR-based  ATR.  One  of 
our  major  successes  to  date  under  the  current  ATR- 
URI  project  is  in  developing  multiresolution  stochas¬ 
tic  models  for  SAR  imagery  that  accurately  capture 
the  scale-to-scale  statistical  variability  of  speckle  in 
SAR  imagery.  In  one  application,  we  used  models 
for  natural  clutter  and  for  man-made  objects  to¬ 
gether  with  our  fast  statistical  likelihood  calculation 
methods  to  develop  an  enhanced  discrimination  fea¬ 
ture  that,  when  integrated  into  Lincoln  Labs’  ATR 
algorithm  and  tested  on  a  very  large  data  set,  re¬ 
sulted  in  a  factor  of  6  reduction  in  false  alarms  over 
the  previous  best  results.  In  the  second  applica¬ 
tion,  we  used  our  models  to  segment  natural  clutter 
(trees  and  grass)  and  to  enhance  anomalous  pixels 
(due  to  man-made  scatterers)  that  did  not  produce 
the  scale-to-scale  variability  consistent  with  natural 
clutter.  The  results  are  very  accurate  segmentations 
and  enhanced  visibility  of  anomalies  as  compared  to 
widely  used  CFAR  methods  (see  Figure  1). 

We  believe  there  are  many  additional  applications 
for  these  multiscale  models.  For  example,  anoma¬ 
lies  that  result  from  man-made  objects  exhibit  them¬ 
selves  as  distinctive  patterns  across  scale  that  differ 


Figure  1:  Example  of  multi-resolution  segmentation,  by 
recursively  subdividing  residual  areas  based  on  a  multi¬ 
scale  model  of  statistical  variations  in  the  SAR  data. 

significantly  from  the  scale-to-scale  textural  varia¬ 
tions.  Consequently,  chains  of  pixels  across  scale 
could  in  principle  be  viewed  as  robust  and  statis¬ 
tically  meaningful  features  that  can  be  further  ex¬ 
ploited  for  model-based  recognition.  We  propose  to 
use  multiscale  features  for  higher  level  recognition 
and  reasoning.  Classically,  target  models  include  ge¬ 
ometrical  constraints  on  the  appearance  of  features 
in  space.  In  this  new  framework,  models  will  also  in¬ 
clude  information  about  the  appearance  of  features 
across  scale.  The  development  of  such  models  is  a 
central  objective  of  this  project.  Once  we  have  such 
models,  we  can  use  our  statistically  optimal  methods 
for  evaluating  likelihoods  to  evaluate  match  scores 
for  hypothesized  models  and  poses. 

2.  SAR-based  ATR  incorporating  pose- 
dependent  SAR  image  formation  and  anal¬ 
ysis.  Scattering  patterns  for  man-made  scatterers 
possess  very  different  characteristics  from  those  of 
natural  scatterers.  In  particular,  while  the  latter  fre¬ 
quently  can  be  modeled  as  diffuse  isotropic  scatter¬ 
ers,  the  former  frequently  have  strong  specular  char¬ 
acteristics,  which  implies  that  they  have  extremely 
strong  aspect-dependent  responses.  This  effect  is 
even  more  pronounced  in  low-frequency  SAR  as  is 
used  for  foliage  penetration. 

While  some  work  has  attempted  to  account  for  the 
difference  in  scatterer  type,  it  is  fair  to  say  that 
a  fundamental  look  at  this  problem  has  yet  to  be 
taken.  Our  group  has  begun  to  do  this,  with  an  ini¬ 
tial  analysis  that  indicates  that  there  is  significant 
discriminating  information  to  be  obtained  if  one  ex¬ 
amines  sets  of  SAR  images  of  a  scene  constructed 
using  different  subapertures  of  the  full  SAR  aper- 
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ture.  What  this  suggests  is  that  the  natural  data 
structure  for  higher-level  ATR  functions  will  again 
involve  imagery  at  different  resolutions,  but  that  in 
this  case  the  focus  is  on  exploiting  differences  in  im¬ 
agery  obtained  from  different  parts  of  the  aperture 
(and  hence  from  different  viewing  angles).  Note  that 
this  results  in  a  very  novel  pose  estimation  problem: 
the  model  of  a  specular  reflector  must  capture  the 
fact  that  changes  in  pose  not  only  change  the  rel¬ 
ative  geometry  of  features  but  also  can  change  the 
appearance  of  these  features  in  imagery  from  differ¬ 
ent  apertures.  As  a  result,  the  action  of  the  pose 
transformation  group  on  a  set  of  images  from  dif¬ 
ferent  apertures  and  at  different  resolutions  is  not 
simply  a  geometric  transformation. 

These  two  contexts  will  serve  as  focal  points  for  our 
work  in  the  near  future.  This  work,  we  expect,  will 
build  on  previous  successes  we  have  had  in  leveraging 
a  cross-disciplinary,  statistically  sound  approach  to 
ATR  analysis  and  development.  Examples  of  our 
work  in  this  arena  are  detailed  below. 

2  Specific  Research  issues 

2.1  Multiresolution  methods  in  laser 
range  imaging 

This  project  addresses  laser  radar  range  imaging  us¬ 
ing  realistic  models  for  the  uncertainty  in  the  mea¬ 
surements  provided  by  the  sensor.  Special  attention 
is  paid  to  the  so-called  range  anomalies,  which  are 
due  to  nonlinear  combined  effect  of  laser  speckle  and 
receiver  noise.  We  use  a  Haar- wavelet,  expectation- 
maximization  (EM)  algorithm  framework  to  produce 
a  maximum-likelihood  (ML)  multiresolution  approx¬ 
imation  of  the  range  truth.  This  allows  us  to  tradeoff 
between  the  resolution  of  the  reconstruction  and  the 
distortions  caused  by  anomalies.  That  is,  as  we  at¬ 
tempt  to  resolve  finer  and  finer  scale  detail  we  are 
forced  to  use  a  greater  percentage  of  the  measured 
pixels  and  therefore  are  constrained  in  the  number 
that  can  be  declared  to  be  anomalous.  Using  sensor 
statistics  we  can  then  determine  the  correct  balance 
between  resolution  and  anomaly  rejection. 

By  exploiting  the  block-nature  of  the  Haar-wavelet 
estimation  procedure,  the  maximum-likelihood 
(ML)  fitting,  via  the  expectation-maximization 
(EM)  algorithm,  of  P  Haar-wavelet  blocks  to  a  Q- 
pixel  image  can  be  accomplished  for  practical  image 
sizes  with  reasonable  computational  complexity  and 
excellent  numerical  robustness.  This  EM/ML  pro¬ 
cessor  has  proven  to  be  nearly  optimal  -  its  estima¬ 
tion  performance  approaches  the  ultimate  limit  set 
by  the  complete-data  form  of  the  Cramer- Rao  bound 
-  with  quantifiable  performance  characteristics. 


The  fast  EM/ML  algorithm  has  been  made  avail¬ 
able  for  general  use.  A  web  page  has  been  mounted 
(hUp://cis. wustl.edu/  mii-cis/  laserradar/  fmlem/ 
fmlmalg.html)  which  provides:  the  software,  a  user 
manual  [9],  a  full  technical  paper  [10],  and  a  sample 
input  file  plus  processed  imagery.  Our  current  re¬ 
search  in  laser  radar  range  imaging  is  aimed  at  wed¬ 
ding  the  fast  EM/ML  algorithm  to  a  model-based 
object  recognizer  that  relies  on  an  EM-based  align¬ 
ment  procedure  [25].  This  effort  also  leverages  off 
the  availability  of  large  amounts  of  data  from  the 
DARPA/MIT  Lincoln  Laboratory  Infrared  Airborne 
Radar  Data  Release  [2]. 

2.2  Multiresolution  models  for  aspect- 
dependent  SAR  imaging 

This  relatively  new  effort  draws  its  motivation  from 
earlier  work  by  members  of  our  group  and  collab¬ 
orators  at  Lincoln  Labs  and  Alphatech,  which  has 
used  Lincoln  Laboratory  SAR  data  to  demonstrate 
the  efficacy  of  multiresolution  discriminants  for  dis¬ 
tinguishing  targets  from  clutter  as  well  as  the  ben¬ 
efits  of  exploiting  the  aspect-dependence  (broadside 
flash)  seen  in  foliage  penetrating  SAR  imagery.  Un¬ 
like  these  efforts,  however,  the  new  work  builds  from 
fundamental  phenomenology  originally  established 
for  the  study  of  optical  SAR  [22].  Initial  results  [17] 
appear  to  be  quite  promising,  as  highlighted  below. 

Our  work  has  compared  the  carrier-to-noise  ratio 
(CNR)  signatures  produced  in  an  idealized  1-D, 
continuous-wave,  SAR  by  isolated  specular  and  dif¬ 
fuse  reflectors  of  the  same  size.  Whereas  both  re¬ 
flectors  produce  the  same  average  intensity  image, 
in  stripmap  operation,  when  processed  to  full  reso¬ 
lution,  such  is  not  the  case  for  multiresolution  pro¬ 
cessing.  The  coherence  of  the  specular  reflector  im¬ 
parts  a  higher  resolution  capability  -  in  the  inter¬ 
mediate  processing  regime  wherein  the  along-track 
chirp  compression  filter  has  a  processing  time  that 
is  long  enough  to  produce  an  effective  synthetic 
aperture,  but  short  enough  that  the  maximum,  i.e., 
transmitter-dwell-limited,  synthetic  aperture  reso¬ 
lution  has  not  been  realized  -  which  may  provide 
a  powerful  discriminant  for  separating  man-made 
(specular)  returns  from  natural  terrain  (diffuse)  re¬ 
turns.  This  initial  model  also  provides  a  theoretical 
basis  for  the  aspect-dependence  (broadside  flash)  ef¬ 
fects  expected  from  a  specular  target’s  tilt  with  re¬ 
spect  to  the  radar-to-ground  axis. 

We  have  refined  and  extended  the  preceding  CNR 
signatures  in  several  important  ways.  First,  by 
positing  the  idealized  hypothesis  testing  problem  of 
detecting  a  single  specular  return  in  the  midst  of 
featureless  diffuse  clutter,  we  have  verified,  in  this 


1025 


simple  case,  the  essentialy  premise  stated  in  the  pre¬ 
vious  paragraph.  In  particular,  he  has  shown  that 
optimum  processing  does  not  employ  a  maximum- 
resolution  imager.  Indeed,  under  exactly  the  cir¬ 
cumstances  that  lead  to  a  significant  difference  in 
the  CNR  signatures  of  specular  and  diffuse  objects, 
we  find  that  the  optimum  detection  processor  has  a 
substantially  better  detection  probability  -  for  the 
same  false-alarm  probability  -  than  the  full  reso¬ 
lution  processor.  In  general,  the  optimum  receiver 
for  this  detection  problem  uses  a  whitening  filter  (to 
flatten  the  spectrum  of  the  clutter-plus-noise  enter¬ 
ing  the  receiver)  followed  by  a  matched  filter  for  the 
whitening-filtered  specular  return.  We  have  shown, 
however,  that  under  interesting  operating  conditions 
this  optimum  receiver  is  well  approximated  by  a  sim¬ 
pler  multiresolution-filter  receiver.  The  1-D  SAR 
analysis  appears  in  [18]. 

Our  longer-term  plans  include:  hypothesis  test¬ 
ing  for  a  composite  target  -  comprised  of  multiple 
target-reflection  primitives,  i.e.,  speculars,  dihedrals, 
trihedrals  —  when  it  is  embedded  in  diffuse  clutter; 
CNR  signatures  for  stripmap,  spotlight-mode,  and 
polarimetric  operation  of  a  2-D  SAR.  We  have  al¬ 
ready  completed  full  polarimetric  generalization  of 
our  specular,  dihedral,  and  trihedral  target  models. 

2.3  High-Resolution  Pursuit  for  Robust 
Feature  Extraction 

Members  of  our  group  have  attacked  the  problem  of 
robust  multiresolution  feature  extraction  through  a 
technique  which  we  refer  to  as  high-resolution  pur¬ 
suit.  This  work,  done  in  collaboration  with  Prof. 
Stephane  Mallat  of  Courant  Institute,  is  a  variation 
on  Mallat’s  matching  pursuit  algorithm  involving  a 
new  criterion  that  trades  off  between  global  fit  and 
local  fit  in  choosing  each  of  a  succession  of  features. 
This  involves  only  a  modest  increase  in  complexity 
over  matching  pursuit  but  leads  to  features  which 
are  much  more  clearly  connected  to  physical  fea¬ 
tures  in  data  and  which  appear  to  have  significant 
robustness  to  several  types  of  noise,  including  ad¬ 
ditive  noise  as  well  as  “spiky”  noise,  as  one  would 
expect  in  speckle-corrupted  radar  data.  The  initial 
application  of  this  method  has  been  to  the  prob¬ 
lem  of  recognition  of  objects  from  their  silhouettes. 
Specifically,  there  is  a  natural  way  in  which  to  map 
a  silhouette  into  a  1-D  signal  corresponding  to  the 
outline  of  the  object,  and  a  number  of  recognition 
methods  have  been  developed  that  are  based  on  an¬ 
alyzing  these  1-D  silhouette  functions.  These  meth¬ 
ods  all  have  some  deficiencies,  notably  robustness  to 
noise  or  to  the  absence  of  features  or  the  introduc¬ 
tion  of  spurious  features  due  to  noise,  occlusion,  or 


anomalous  estimation  of  object  silhouette  as  can  oc¬ 
cur  in  practice.  The  vehicle  that  we  are  using  to 
test  our  approach  is  the  recognition  of  silhouettes  of 
a  number  of  different  aircraft,  a  problem  that  has 
been  considered  by  others  and  which  thus  provides 
us  with  a  fair  way  in  which  to  compare  our  meth¬ 
ods.  The  results  of  extensive  testing  demonstrates 
the  superiority  of  our  new  approach,  including  an  in¬ 
creased  level  of  robustness  to  the  types  of  silhouette 
errors  observed  in  practice. 

This  technique  has  been  transitioned  to  Alphatech, 
Inc.,  where  high-resolution  pursuit  is  being  applied 
to  the  problem  of  robust  compression  of  SAR  tar¬ 
get  models  for  template-based  ATR  and  to  prob¬ 
lems  of  multisensor  fusion.  In  addition,  members  of 
our  group  have  recently  developed  a  robust  wavelet- 
based  technique  for  optimal  adaptive  wavelet  repre¬ 
sentation  for  noisy  data  and  have  begun  an  exam¬ 
ination  of  the  applicability  of  this  method  to  high- 
resolution  radar  feature  extraction. 

2.4  Multiresolution  Analysis  of  SAR 

In  earlier  work,  our  group  in  conjunction  with  Dr. 
William  Irving  (Alphatech),  and  Dr.  Les  Novak  (Lin¬ 
coln  Labs)  had  considerable  success  in  constructing 
and  using  multiresolution  models  for  SAR  imagery 
as  the  basis  for  enhancing  Lincoln’s  discrimination 
algorithm,  resulting  in  a  reduction  in  false  alarm 
rates  by  a  factor  of  6  over  Lincoln’s  baseline  algo¬ 
rithm.  Following  this,  our  group  had  extended  these 
multiresolution  SAR  likelihood  ratio  methods  to  the 
problem  of  distinguishing  different  types  of  natural 
terrain  (trees,  grass,  etc.).  The  method  we  have  de¬ 
veloped  has  been  shown  to  provide  highly  reliable 
classification  decisions  and  accurate  estimates  of  ter¬ 
rain  boundaries.  In  addition,  our  group  has  refined 
and  extended  these  methods  in  order  to  develop  ef¬ 
ficient  SAR  compression  algorithms  that  take  ad¬ 
vantage  of  the  speckle-decorrelating  property  of  our 
multiresolution  models.  These  methods  have  been 
transitioned  both  to  Lincoln  and  recently  to  Alphat¬ 
ech,  Inc.  In  addition  to  these  efforts,  we  have  also 
recently  initiated  another  new  project  aimed  at  ex¬ 
ploiting  the  properties  of  our  multiresolution  SAR 
models  even  further.  Specifically,  we  have  begun  to 
look  at  the  use  of  multiresolution  SAR  models  for 
optimal  or  near-optimal  model-based  ATR.  Specifi¬ 
cally,  since  our  models  do  a  surprisingly  good  job  of 
whitening  speckle,  they  can  provide  the  basis  for  op¬ 
timal  extraction  of  features  corresponding  to  statis¬ 
tically  signficant  deviations  from  this  white  behav¬ 
ior,  which  can  be  directly  related  to  the  presence  of 
one  or  several  dominant  scatterers.  In  our  work  we 
expect  to  investigate  both  enhanced  template-based 
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methods  as  well  as  fully  model-based  methods,  for 
which  the  problems  of  pose  estimation,  search,  and 
match  take  on  new  twists,  thanks  to  the  multires¬ 
olution  nature  of  our  features.  In  addition,  as  an 
alternative  direction  we  are  also  exploring  the  use  of 
multiresolution  models  to  capture  aspect-dependent 
effects  of  significant  scatterers.  The  key  idea  here  is 
that  the  effect  of  pose  in  such  a  situation  becomes 
more  complex,  since  pose  can  effect  not  only  the  lo¬ 
cation  but  the  appearance  of  aspect-dependent  scat¬ 
terers. 

2.5  Estimation-Theoretic  SAR  Image 
Formation  for  Moving  Scenes 

Our  group,  in  collaboration  with  Alphatech,  has  re¬ 
cently  developed  the  foundation  of  an  estimation- 
theoretic  approach  to  optimal  SAR  image  formation 
when  there  is  motion  in  the  scene  being  imaged. 
The  starting  point  for  this  work  involves  viewing  the 
problem  as  one  of  joint  position-velocity  imaging- 
i.e.,  the  SAR  equivalent  of  range-Doppler  imaging 
but  now  in  4  dimensions  (2  space  and  2  spatial  ve¬ 
locity).  We  have  developed  a  novel  and  very  effi¬ 
cient  method  for  calculating  what  can  alternatively 
be  thought  of  as  the  likelihood  function  for  the  lo¬ 
cation  in  range-rate/cross  range  of  a  scatterer  or  as 
an  image  in  this  2-D  space  for  each  range  bin.  Us¬ 
ing  this  likelihood  function  he  has  now  developed  a 
first  approach  to  SAR  imaging  that  does  not  require 
motion  to  be  rigid  body.  Specifically,  one  of  the  ob¬ 
jectives  of  this  work  is  to  develop  a  method  that  can 
focus  an  entire  SAR  image  even  if  different  scatterers 
in  the  scene  are  moving  differently  (including  hav¬ 
ing  a  target  moving  across  a  stationary  background). 
We  have  recently  obtained  promising  results  on  sim¬ 
ulated  data  demonstrating  that  the  method  can  in¬ 
deed  focus  images  with  non-rigid  motion.  We  are 
also  in  the  process  of  characterizing  the  blur  in  the 
higher-dimensional  imaging  framework  in  order  to 
developed  enhanced  methods  when  there  are  inter¬ 
fering  scatterers  (e.g.  due  to  background  clutter  and 
moving  targets).  One  challenge  here  is  the  develop¬ 
ment  of  a  test  environment  that  is  both  realistic  and 
that  allows  in  essence  Monte  Carlo  testing  in  order 
to  obtain  statistically  significant  figures  of  merit  for 
our  approach.  The  idea  here  is  not  to  use  X-Patch 
or  other  simulators  but  instead  to  take  real  SAR 
data  and  selectively  modify  phase  histories  in  order 
to  mimic  the  effects  of  motion. 

2.6  Segmentation  and  Feature  Extrac¬ 
tion  of  Speckle-Corrupted  Imagery 

Our  group  has  also  had  a  significant  success  in  a 
new  effort,  aimed  at  developing  robust  image  seg¬ 


mentation  algorithms.  Our  approach  is  based  on 
a  detailed  examination  of  the  very  active  field  of 
nonlinear  diffusions  and  curve  evolution  in  image 
processing,  from  which  we  developed  a  very  simple 
variant  that  not  only  leads  to  vastly  reduced  com¬ 
putational  burdens  but  also  provides  explicit  seg¬ 
mentations  at  a  hierarchy  of  resolutions  (rather  than 
requiring  substantial  postprocessing  and  interpreta¬ 
tion).  This  new  algorithm  is  characterized  by  a  set  of 
coupled  differential  equations  for  the  transformation 
of  image  pixel  values  (analogous  to  the  linear  differ¬ 
ential  equations  that  lead  to  the  scale-space  concepts 
developed  by  Witkin  and  others),  where  in  our  case 
the  differential  equations  have  a  signicant  disconti¬ 
nuity  that  leads  both  to  robust  edge  identification 
and  to  noise  removal,  including  the  removal  of  occa¬ 
sional  high-amplitude  noise  spikes.  We  have  recently 
demonstrated  that  this  algorithm  can  perform  sur¬ 
prisingly  accurate  segmentations  for  the  very  high 
speckle  levels  present  in  single-polarization  SAR  im¬ 
agery.  Given  this  success,  we  are  currently  investi¬ 
gating  the  extension  and  application  of  this  method¬ 
ology  to  problems  such  as  robust  feature  extraction. 

2.7  Mutual  Information  Based  Regis¬ 
tration  for  Recognition  and  Fusion 

Our  group  has  developed  a  new  approach  for  finding 
the  pose  of  an  object  model  in  an  image.  In  earlier 
work,  we  examined  methods  that  utilized  the  Ex¬ 
pectation/Maximization  algorithm  to  trade  off  solv¬ 
ing  for  the  correspondence  between  model  and  data 
features,  and  solving  for  the  pose  of  the  target  in 
the  data  coordinate  frame.  A  multiresolution  ver¬ 
sion  of  the  method  was  designed  and  implemented  to 
demonstrate  the  potential  efficiencies  and  robustness 
gained  by  using  multiresolution  data  and  models.  A 
drawback  to  this  approach  is  a  reliance  on  explicit 
uncertainty  models.  As  an  alternative,  we  have  de¬ 
veloped  a  new  approach  for  finding  the  pose  of  an 
object  model  in  an  image,  based  on  a  new  formula¬ 
tion  of  the  mutual  information  between  model  and 
image.  As  applied  here  the  technique  is  intensity- 
based,  rather  than  feature-based.  It  works  well  in 
domains  where  edge  or  gradient-magnitude  based 
methods  have  difficulty,  yet  it  is  more  robust  than 
traditional  correlation. 

Our  approach  to  aligning  the  object  model  to  the 
image  is  based  on  the  following  steps:  (1)  The  mu¬ 
tual  information  of  the  model  and  image  is  defined, 
and  is  expressed  in  terms  of  the  entropies  of  sev¬ 
eral  random  variables.  (2)  The  entropies  and  their 
derivatives  are  approximated  by  a  method  that  in¬ 
volves  random  sampling  from  the  model  and  image 
data,  or  by  using  histogramming  methods.  (3)  A  lo- 
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Figure  2:  Using  Mutual  Information  to  register  a 
SAR  image  and  a  visible  light  image  of  the  same  lo¬ 
cation  (from  the  Stockbridge  data  set).  Left:  visible 
light  image.  Right:  SAR  image  of  same  location. 
Center:  the  visible  light  image  rotated  and  trans¬ 
lated  so  that  it  is  aligned  with  the  SAR  image. 

cal  maximum  of  the  mutual  information  is  sought  by 
using  a  stochastic  analog  of  gradient  descent.  Steps 
are  repeatedly  taken  that  are  proportional  to  the 
approximation  of  the  derivative  of  the  mutual  infor¬ 
mation  with  respect  to  the  transformation. 

We  have  demonstrated  the  applicability  of  this 
method  to  tracking  moving  3D  objects  in  video  se¬ 
quences,  to  registering  SAR  with  video  or  other  data 
sources,  and  to  registering  target  models  with  ex¬ 
tracted  SAR  data  (see  Figure  2). 
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1.  Motivation 


The  first  topic  of  the  research  deals  with  the 
problem  of  automatically  extracting  features 
from  imagery.  The  selection  of  features  has  been 
primarily  an  heuristic  procedure.  Humans  pick 
features  that  seem  to  be  relevant  to  solve  the 
problem  based  on  their  experience  with  the  data. 
This  procedure  has  two  shortcomings:  one  does 
not  know  a  priori  the  discriminant  power  of  the 
features,  and  second,  when  the  signals  (or 
images)  are  collected  with  new  sensors  or  the 
imagery  is  new  there  is  no  available  knowledge 
so  there  is  a  lag  in  the  exploitation  of  the  imag¬ 
ery.  We  believe  that  this  may  be  happening  in 
synthetic  aperture  radar  (SAR)  imagery  due  to 
the  novelty  of  the  technique  and  the  principles 
involved  in  image  formation,  which  are  different 
from  optical  images  (Figure  1). 


Where  is  the  water  tower? 


Our  approach  seeks  to  develop  a  new  methodol¬ 
ogy  to  extract  automatically  features  from  the 
imagery  using  information  theoretic  principles. 


Acknowledgment:  This  work  was  partially  support  by 
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Our  ultimate  goal  is  to  perform  feature  extraction 
through  optimization  of  the  information  transfer 
from  the  original  image  to  a  subspace.  If  we  are 
successful,  a  general  and  systematic  procedure 
will  be  available  to  extract  features  from  any 
type  of  imagery.  Since  there  is  no  desired 
response  at  the  feature  extraction  stage,  the 
method  has  to  be  self-organizing. 

The  second  aspect  of  the  research  is  to  enhance 
our  present  methodology  of  training  classifiers. 
So  far  our  classifiers  are  trained  in  the  laboratory 
and  their  parameters  set  “for  ever”.  There  is 
strong  evidence  that  the  performance  of  any 
machine  that  learns  from  the  environment  (i.e. 
through  induction)  is  ultimately  limited  by  the 
amount  of  data  that  it  is  trained  with.  This  means 
that  machines  should  always  be  learning  when 
exposed  to  new  data,  even  after  being  deployed. 
There  is  little  knowledge  how  to  do  this  in  a  sys¬ 
tematic  and  robust  way.  So  a  second  goal  of  this 
research  is  to  seek  methods  to  continue  training 
classifiers  after  deployment. 

2.  Research  Questions 

a)  Statistical  Independent  features 

Formulate  feature  extraction  as  a  process  of 
transferring  the  most  information  about  the  input 
signal  through  a  special  information  channel 
which  includes  a  projection  to  a  subspace. 

•  Develop  an  algorithmic  methodology  to 
implement  the  feature  extraction. 
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•  Characterize  the  methodology  and  compare  it 
to  independent  component  analysis  and  prin¬ 
cipal  component  analysis. 

•  Validate  the  method  with  SAR  data  and  com¬ 
pare  it  with  other  existing  methods  of  feature 
extraction. 

h)  Training  after  deployment 

•  Seek  a  methodology  to  keep  training  classifi¬ 
ers  after  deployment  by  using  ideas  from  rein¬ 
forcement  learning. 

•  Study  alternate  methodologies  based  on  the 
EM  algorithm  and  the  newly  introduced  mix¬ 
ture  of  experts  model. 

3.  Relevance 

This  work  is  relevant  because  it  is  addressing 
fundamental  questions  in  pattern  recognition.  It 
will  also  impact  the  present  state-of-the-art  in 
automatic  target  recognition  (ATR)  using  SAR. 
SAR  is  a  new  imagery  that  is  very  different  from 
optical  images.  What  are  the  most  discriminant 
features  in  SAR?  How  shall  we  extract  them? 
What  is  the  information  content  of  SAR  versus 
optical  imagery?  These  are  very  important  ques¬ 
tions  that  need  to  be  answered  using  a  principled 
approach  as  the  one  to  be  developed  here. 

The  second  aspect  is  also  very  important  for  ATR 
since  these  systems  are  a  mixture  of  model  based 
approaches  with  data  driven  methods  (see  the 
MSTAR  program).  We  firmly  believe  that  sooner 
or  later  the  performance  limit  will  be  the  lack  of 
data  used  to  train  these  systems.  Data  collection 
is  extremely  time  consuming  and  expensive. 
However,  ATR  systems  are  exposed  to  lots  of 
data  when  they  are  deployed.  The  issue  is  how  to 
harness  this  exposure  to  new  data  and  keep  train¬ 
ing  the  ATR  system. 

4.  Methodologies 

In  the  first  topic  (feature  extraction)  we  plan  to 
use  information  theory  and  artificial  neural  net¬ 
works  to  create  nonlinear  subspace  projections 
that  preserve  as  much  as  possible  the  input  image 
information  after  the  projection  into  the  feature 
space.  We  have  already  developed  a  methodol¬ 
ogy  to  manipulate  the  information  in  the  output 
We  still  need  robust  training  algorithms  to  per¬ 


form  the  task.  The  characterization  of  the  quality 
of  the  method  and  comparison  with  alternate 
methodologies  is  also  necessary. 

In  the  second  topic  (training  after  deployment), 
we  will  formulate  the  problem  using  reinforce¬ 
ment  learning  algorithms  to  include  the  scalar 
input  from  the  operator  (Yes/No).  We  will  also 
study  other  self-organizing  alternatives  such  as 
the  EM  (estimation  maximization)  algorithm  and 
the  recently  introduced  mixture  of  expert  (MOE) 
model  (see  our  paper  in  the  proceedings). 

Please  see  our  WEB  page  for  further  details 
(http://www.cnel.ufl.edu/). 

5.  Evaluation 

We  will  start  with  some  easy  to  interpret  “syn¬ 
thetic”  cases  to  obtain  a  better  understanding  of 
the  algorithms,  but  the  bulk  of  the  evaluation  will 
utilize  the  newly  released  SAR  data  for  the 
MSTAR  program.  Our  goal  is  to  compare  the 
performance  of  our  feature  extractor  with  more 
traditional  features  such  as  the  gamma  intensity 
features  [1],  [2].  We  propose  to  use  the  same 
classifier  (such  as  the  nonlinear  MACE  filter  [3] 
and  the  MSTAR  classifier  stage)  and  substitute 
our  features  with  the  traditionally  utilized  for 
perfromance  comparisons.  The  comparison  will 
follow  the  well  established  procedure  of  ROC 
(receiver  operating  characteristics)  curves.  We 
expect  to  provide  technology  transfer  to  MSTAR 
contractors  developing  focus  of  attention  and 
segmentation  algorithms.  They  will  be  the  ulti¬ 
mate  judges  of  the  quality  of  our  work.  Part  of 
our  technology  transfer  is  also  to  develop  mod¬ 
ules  for  the  image  understanding  environment. 
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Abstract 

This  analysis  of  XPatch  synthesized  SAR  images 
shows  promising  results  on  existence  of  persistent 
scatterers  and  their  measurement  in  XPatch  im¬ 
agery.  Preliminary  results  based  on  persistent  scat¬ 
terers  and  a  generic  vehicle  model  demonstrate  es¬ 
timation  of  vehicle  orientation  to  about  3  degrees, 
length  between  wheels  to  about  30  cm,  and  estimat¬ 
ing  number  of  wheels,  without  knowledge  of  vehi¬ 
cle  class  or  pose.  Planned  research  will  investigate 
recognition  using  generic  target  and  clutter  models 
in  analysis  of  a  stable,  useful  subset  of  radar  images. 
Research  will  include  quantitative  characterization 
of  persistent  scatterers  and  stable  relations,  an  end- 
to-end  recognition  test,  and  investigation  of  support¬ 
ing  technology  for  MSTAR.  The  paradigm  offers  a 
genuine  methodology  to  cope  with  articulation  and 
obscuration. 


1  Introduction 

SAR  imagery  is  known  to  vary  rapidly  with  tar¬ 
get  angle.  Scatterers  appear  and  disappear  over  a 
few  degrees.  The  resultant  image  variability  makes 
ATR  template  matching  computationally  complex, 
requiring  search  for  match  over  a  relatively  dense 
set  of  angles.  That  image  variability  also  lessens  the 
discrimination  power  of  template  matching  in  ATR, 
because  data  images  cannot  be  guaranteed  to  match 
exactly.  Matching  errors  tend  to  introduce  biases  in 
matching;  matching  algorithms  must  relax  matching 
constraints.  Existing  systems  work  reasonably  for 
unarticulated,  unobstructed  targets,  e.g.  MSTAR 
and  [Ikeuchi  96;  Novak  94;  Novak  95;  Novak  96]. 


’This  research  was  supported  by  a  contract  from 
the  Air  Force,  F33615-93-1-1281  through  WPAFB  from 
ARPA  ASTO  “Multi-Sensor  ATR;  Quasi-Invariants  and 
High  Accuracy  Measurements  in  Bayesian  Inference” 
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Imagery” . 


However,  results  degrade  with  obscuration  and  ar¬ 
ticulation. 

Research  leading  to  this  project  was  intended  to  in¬ 
vestigate  whether  a  stable,  useful  subset  of  radar 
images  could  be  found,  i.e.  to  investigate  whether 
persistent  scatterers  exist,  enough  scatterers  suffi¬ 
ciently  persistent  that  indexing  could  be  done  ini¬ 
tially  with  persistent  scatterers,  to  characterize  their 
behavior  in  a  preliminary  way,  and  to  demonstrate 
that  scatterer  peaks  could  be  estimated  accurately 
to  represent  target  signals  in  MSTAR  or  in  an  in¬ 
tended  matching  algorithm. 

We  were  motivated  by  analysis  of  scattering  of  target 
components  to  ask  whether  stable  scatterers  exist. 
A  component  scattering  model  was  not  regarded  as 
plausible  in  1991.  In  turntable  experiments,  [Dud¬ 
geon  94]  found  evidence  for  persistent  scatterers,  a 
minimum  of  about  10  at  any  angle,  persistent  for 
a  minimum  of  20  degrees,  with  no  disappearance  of 
more  than  one  degree.  Subsequently,  we  found  per¬ 
sistent  scatterers  in  XPatch  data.  A  better  analytic 
understanding  was  since  derived  for  persistent  scat¬ 
terers  and  for  a  generic  component  scattering  model. 

An  analysis  based  on  partial  matching  of  persistent 
features  to  generic  vehicle  models  was  proposed  as  a 
solution  to  recognition  with  obscuration  and  articu¬ 
lation.  Generic  models  with  partial  matching  is  also 
a  methodology  for  hypothesis  generation  without  ex¬ 
haustive  enumeration  of  targets  and  poses.  Lessen¬ 
ing  variability  by  choosing  persistent  scatterers  and 
stable  relations  might  possibly  compensate  in  part 
for  the  loss  of  unavoidable  loss  of  discriminability 
with  obscuration  and  articulation. 

The  objective  of  this  project  is  to  develop  target 
recognition  based  on  persistent  scatterers  and  stable 
relations  in  the  support  of  the  MSTAR  paradigm  for 
ATR  and  in  development  of  a  new  recognition  sys¬ 
tem.  The  following  subgoals  lie  along  the  path  to¬ 
ward  that  goal:  to  characterize  persistent  scatterers 
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further;  to  characterize  stable  relations  among  per¬ 
sistent  scatterers;  to  investigate  non-persistent  scat- 
terers  and  incorporate  matching  them  in  a  match¬ 
ing  scheme;  to  develop  a  generic  image  model  and 
generic  target  model;  to  derive  and  implement  a 
recognition  algorithm  in  a  Bayesian  network  [Bin- 
ford  et.  al.  87],  based  on  generic  image  and  target 
models  to  estimate  angle  and  dimensions  of  vehicles 
without  enumerating  vehicle  class  and  pose. 

At  the  same  time,  context  is  widely  regarded  as 
an  eifective  way  to  eliminate  large  classes  of  clut¬ 
ter  from  fixed  targets  and  civilian  vehicles  [Levitt 
95].  This  research  will  investigate  context-enhanced 
discrimination  of  targets  vs  natural  and  cultural 
clutter  based  on  correspondence  between  context 
from  available  databases  {e.g.  DMIF  or  HUB) 
and  extended  structures  registered  from  imagery 
[Oliver  90;  Posner  93;  Wellman  93].  Progress  in 
using  context  will  come  from  improved  discrimina¬ 
tion  using  structures  found  from  texture  analysis 
and  from  buildings  discriminated  from  analysis  of 
leading  edges  in  their  radar  image. 


2  Toward  an  Effective  Methodology 

Some  questions  need  to  be  answered  concerning  ef¬ 
fectiveness  of  this  methodology.  1)  Is  there  a  use¬ 
ful,  stable  subset  of  a  SAR  image,  enough  persis¬ 
tent  scatterers  to  constrain  target  hypotheses,  suf¬ 
ficiently  persistent  to  implement  a  hierarchical  in¬ 
dexing  scheme  proposed  in  this  component-based, 
body-centered  recognition  paradigm?  2)  Are  there 
stable  relations  among  persistent  scatterers  that  en¬ 
able  partial  matching  that  is  computationally  afford¬ 
able?  Among  partial  matching  algorithms  are  both 
computationally  effective  and  computationally  ex¬ 
pensive  algorithms.  3)  Can  features  be  extracted 
from  persistent  scatterers  accurately  enough  over  a 
broad  range  of  target  orientations  with  enough  in¬ 
dependence  from  interference  among  scatterers  to 
enable  stable  estimation  of  scatterer  locations  and 
stable  estimation  of  relations  among  them  to  enable 
estimating  vehicle  orientation  and  dimensions  with¬ 
out  knowing  vehicle  class  or  orientation.  4)  Can  a 
generic  vehicle  model  be  formulated  that  gives  a  ba¬ 
sis  for  predicting  automatically  which  stable  rela¬ 
tions  will  exist  and  be  useful  in  a  broad  range  of 
images. 

We  show  preliminary  results  for  XPatch  data  for 
three  vehicles.  Using  synthesized  data  provides  data 
over  a  range  of  controlled  situations  essential  to  de¬ 
veloping  a  sound  algorithm  basis,  before  develop¬ 
ment  on  real  data.  Ikeuchi  at  CMU  provided  the 
XPatch  data.  In  Figure  1,  at  the  top  is  a  low  qual¬ 
ity  optical  image  of  a  BTR60.  Below  it  are  6  SAR 
images  from  XPatch  at  a  range  of  angles:  6,  12,  18 
degrees,  then  below,  30,  36,  42  degrees.  Note  that 
the  SAR  images  are  defocussed  in  the  range  direc¬ 


tion;  presumably  they  could  be  focussed  better  by 
a  choice  of  XPatch  parameters.  At  the  bottom  are 
intensity  profiles  along  the  leading  edge  of  the  target 
at  angles  of  31,  37,  and  43  degrees  and  69,  75,  and  81 
degrees.  We  see  four  peaks  with  nearly  constant  am¬ 
plitude  (to  15%).  Note  that  the  peaks  persist  from  6 
degrees  to  174  degrees,  nearly  the  whole  target  angle 
range.  The  BTR60  has  four  wheels  with  wheel  spac¬ 
ing  consistent  with  the  good  measurements  available 
from  peak  detection.  Wheel  spacings  provide  a  geo¬ 
metric  invariant  (or  several)  and  several  geometric 
quasi-invariants.  Peak  amplitudes  provide  photo¬ 
metric  quasi-invariants  independent  of  photometric 
calibration  of  the  image. 

Figure  2  shows  evidence  of  persistent  scatterers  over 
the  whole  vehicle.  In  the  upper  left  are  peak  scatter¬ 
ers  for  the  BTR60,  extracted  by  our  peak  detection 
algorithms  from  XPatch  images  over  angles  from  45 
to  135  degrees.  Results  are  similar  over  nearly  0  to 
180  degrees.  Peaks  extracted  from  XPatch  images  at 
1  degree  intervals  for  constant  slant  angle  are  rotated 
to  zero  degrees  azimuth  and  superimposed.  We  see 
four  clear  clusters  on  the  leading  edge  and  eight  or 
nine  more  clusters  interior  on  the  vehicle,  along  the 
top.  On  the  middle  right,  the  four  peaks  were  iso¬ 
lated  using  their  statistical  distribution.  In  the  lower 
left,  the  four  peaks  are  shown  in  a  3D  plot  intended 
to  demonstrate  in  an  intuitive  and  semi-quantitative 
way  the  level  of  persistence  of  the  scatterers.  The 
plot  shows  (x,y)  position  as  a  function  of  azimuth 
angle  at  angles  from  45  to  135  degrees.  If  we  look 
along  the  vehicle  leading  edge  (x),  we  see  that  the 
peaks  are  well-separated,  with  few  dropouts. 

Existence  of  stable  relations  seem  to  be  born  out  by 
measurements  of  ratios  of  wheel  spacings  over  large 
ranges  of  angles.  There  is  also  strong  analytic  rea¬ 
soning  in  support  of  geometric  invariants  and  quasi¬ 
invariants.  Subsequent  detailed  SAR  image  analysis 
may  strengthen  and  refine  this  expectation.  Much 
more  analysis  will  be  made  of  similar  data  using 
XPatch  backtrace  facilities  to  determine  where  the 
true  scatterers  are,  to  quantify  variability  of  individ¬ 
ual  scatterers  for  incorporation  into  estimation  and 
decision  algorithms,  and  to  serve  as  ground  truth 
for  peak  estimation  and  feature  detection.  Non- 
persistent  scatterers  and  the  ratio  of  persistent  to 
non-persistent  will  be  studied.  These  investigations 
will  be  carried  out  in  collaboration  with  Vince  Velten 
from  Wright-Patterson  AFB  and  Dr.  Eamon  Barret 
and  Dr.  Paul  Payton  from  Lockheed-Martin. 

Figure  3  shows  preliminary  steps  in  target  recogni¬ 
tion.  In  the  upper  left  is  an  XPatch  image  of  the 
BTR60  at  45  degrees.  The  lower  left  shows  the 
Delaunay  triangulation  of  peaks  detected  from  the 
XPatch  image  in  the  target  cluster.  In  the  upper 
right  are  peaks  detected  in  the  image,  with  peaks  on 
the  longer  leading  edge  marked  with  a  surrounding 
circle;  peaks  on  the  shorter  leading  edge  are  marked 
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with  a  cross.  The  leading  edge  perpendicular  pair 
is  shown  by  solid  lines.  In  the  lower  right,  candi¬ 
date  peaks  along  the  leading  edge  were  selected  for 
intensity  from  the  generic  vehicle  model  to  generate 
candidates  for  wheels.  Estimated  leading  edge  orien¬ 
tation,  length,  width,  and  distances  between  wheels 
are  shown. 

We  are  developing  an  algorithm  for  estimating  lead¬ 
ing  edges  of  targets,  insensitive  to  points  not  on 
the  leading  edge.  This  is  a  problem  for  which 
least-squares  estimation  is  ill-suited.  Least  squares 
weights  most  heavily  points  that  are  furthest  away. 
A  non  least-squares  method  was  developed,  essen¬ 
tially  that  used  in  linking  in  the  Binford-Horn  edge 
operator.  That  operator  will  be  used  in  computing 
leading  edges  of  buildings  to  discriminate  buildings 
and  other  fixed  clutter  from  targets. 

Figure  4  shows  estimation  of  azimuth  angles  over  a 
broad  range  of  angles  in  XPatch  data  for  three  tar¬ 
gets,  BTR60,  KTANK,  and  BMP.  These  estimates 
cover  a  range  from  10  to  170  degrees;  above  we  saw 
that  the  range  could  be  extended  a  little  further, 
but  it  cannot  be  extended  below  about  5  degrees. 
[Dudgeon  94]  observed  that  zero  degrees  and  ninety 
degrees  were  special  views.  Those  views  showed  up 
in  our  experiments  and  for  physical  reasons,  in  our 
generic  vehicle  model. 

Over  this  wide  range  of  angles,  the  standard  devi¬ 
ations  for  vehicle  azimuth  angles  were  4.0,  2.3,  and 
3.7  degrees.  We  believe  that  the  measurement  will 
improve  because  we  are  still  developing  the  algo¬ 
rithm.  It  still  makes  mistakes  in  including  interior, 
non  leading  edge  points  that  bias  the  result.  Stan¬ 
dard  deviations  are  strongly  affected  by  these  mis¬ 
takes,  i.e.  wild  points. 

Figure  5  shows  persistence  for  KTANK  similar  to 
Figure  2  for  the  BTR60.  In  Figure  5,  top,  peaks 
from  views  at  1  degree  intervals  from  0  to  180  degrees 
were  rotated  into  the  0  degree  coordinate  frame  and 
superimposed.  At  the  bottom  are  peaks  from  the 
longer  leading  edge  Vive  peaks  appear  more  irreg¬ 
ular  than  in  the  BTR60  images.  Again,  the  peaks 
persist  over  nearly  the  whole  angle  range. 

Similar  results  were  obtained  for  the  BMP.  They  are 
omitted  to  cut  space  and  to  avoid  detail. 


3  Observations 

Persistent  scatterers  have  been  demonstrated  in  real 
SAR  from  turntable  data  [Dudgeon  94],  in  XPatch 
synthesized  SAR,  and  predicted  in  analysis.  At 
this  early  stage,  the  apparent  ratio  of  persistent  to 
non-persistent  scatterers  is  more  favorable  than  ex¬ 
pected.  There  appear  to  be  about  a  dozen  persis¬ 
tent  scatterers  on  vehicles  like  the  three  shown  here. 


Their  persistence  appears  over  a  broader  range  of 
angles  than  initially  expected  but  now  appears  to 
follow  from  good  reasons. 

Are  there  enough  scatterers?  There  appear  to  be 
about  12  persistent  scatterers  typical  for  these  tar¬ 
gets.  If  only  30%  of  them  were  obscured,  the  remain¬ 
der  seem  sufficient.  In  SAR,  two  scatterers  along  the 
leading  edge  provides  a  measured  constraint;  three 
scatterers  are  over-constrained.  Relative  to  a  lead¬ 
ing  edge  estimate,  an  additional  point  off  the  leading 
edge  is  another  constraint  on  matching.  With  only 
three  points,  there  are  two  constraints. 

A  next  step  is  incorporating  points  interior  to  the 
leading  edge,  again  exploiting  the  generic  model. 
Some  of  that  can  still  be  done  at  a  generic  level  for 
vehicles. 

One  foot  resolution  appears  to  be  a  magic  resolution 
at  which  useful  detail  becomes  apparent  for  vehicles 
the  size  of  the  military  vehicles  used  in  these  XPatch 
studies,  about  8  meters,  27  feet.  Although  many 
pixels  cover  these  vehicles’  images,  it  is  the  size  of 
scatterers  that  matters  here.  There  are  few  pixels 
on  wheels,  turrets,  and  other  observable  structures. 
Our  estimation  methods  seem  to  be  on  the  border¬ 
line  of  resolving  the  wheels  on  the  vehicles.  Making 
effective  use  of  available  image  resolution  is  extemely 
important. 

It  seems  that  improvements  in  the  peak  detection 
provides  strong  leverage  for  the  problem.  I.e.  re¬ 
solving  the  wheels  in  all  cases  on  the  leading  edge 
would  contribute  substantially  to  classifying  vehi¬ 
cles.  The  current  peak  detection  was  conceived  for 
isolated  peaks;  it  appears  very  useful  in  terrain  and 
vegetation.  On  targets,  peaks  overlap,  affecting  peak 
detection.  Rather  than  implement  our  understand¬ 
ing  of  a  solution  to  the  overlapping  peak  roblem, 
we  have  chosen  thus  far  to  push  on  to  a  broader, 
birds-eye  view  of  the  issues.  We  are  considering  im¬ 
provements  in  image  and  in  feature  detection  in  the 
raw  SAR  phase  history  data  before  image  formation 
or  exploiting  those  advances  of  others  in  superreso¬ 
lution  [Cabrera  94;  Mann  92;  Odendaal  94]. 

In  fact,  measurement  of  dimensions  seems  to  be  rel¬ 
atively  accurate.  The  distance  between  wheels  has 
a  standard  deviation  of  10  inches.  These  dimensions 
go  a  long  way  toward  constraining  vehicle  class. 


4  Conclusions 

These  preliminary  results  appear  to  warrant  fur¬ 
ther  investigation.  Persistent  scatterers  have  been 
demonstrated  in  real  and  synthesized  SAR  experi¬ 
ments;  they  are  expected  to  exist  for  good  reasons. 
At  this  early  stage,  persistence  appears  over  nearly 
the  whole  angle  range,  again  for  good  reasons.  A  suf- 
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ficient  number  of  persistent  scatterers  appear  that 
it  seems  promising  to  recognition.  Peak  detection 
seems  adequate  for  initial  investigation,  further  ac¬ 
curacy  refinement  may  be  necessary  to  exploit  avail¬ 
able  imagery.  A  generic  vehicle  model  has  proved 
surprisingly  powerful. 


5  References 

[Benitz  94]  G.R.Benitz,  “Adaptive  High-Definition 
Imaging”;  SPIE  Yo\  2230,  pp  106-119,  1994. 

[Binford  87]  T.O.Binford,  et.  at,  “Bayesian  Infer¬ 
ence  in  Model-Based  Machine  Vision”;  Proc  Work¬ 
shop  on  Uncertainty  in  AI,  AAAI87,  1987. 

[Cabrera  94]  S.D.Cabrera,  et.  al,  “Application 
of  One-Dimensional  Adaptive  Extrapolation  to  Im¬ 
prove  Resolution  in  Range-Doppler  Imaging” ;  SPIE 
Vol  2230,  pp  135-145,  1994. 

[Dudgeon  94]  D.E.Dudgeon,  et.  al,  “Use  of  persis¬ 
tent  scatterers  for  model-based  recognition”;  SPIE 
Vol  2230,  pp  356-368,  1994. 

[Haag  91]  N.N.Haag,  et.  al,  “Invariant  Relation¬ 
ships  in  Side-Looking  Synthetic  Aperture  Imagery”; 
Photgrammetric  Engineering  and  Remote  Sensing, 
V  57  pp  927-931,  1991. 

[Ikeuchi  96]  K.Ikeuchi,  et.  al,  “Invariant  Histograms 
and  Deformable  Template  Matching  for  SAR  Target 
Recognition” ;  Proc  IEEE  CVPR,  pp  100-105,  1996. 
[Jane  84]  Foss,  Christopher  F.  “Jane’s  Light  Tanks 
and  Armoured  Cars”;  1984. 

[Levitt  95]  Levitt,  T.S.,  et.  aZ.,”Bayesian  Inference- 
Based  Fusion  of  Radar  Imagery,  Military  Forces  and 
Tactical  Terrain  Models  in  the  Image  Exploitation 
System/Balanced  Technology  Initiative”,  Inti  J.  of 
Human- Computer  Studies,  No.  42,  1995. 

[Mann  92]  J.Mann  and  R.Hummel,  “Synthetic  Aper¬ 
ture  Radar  without  Fourier  Transforms”;  Courant 
Institute,  1992. 

[Moulin  93]  P. Moulin,  “A  Wavelet  Regulariza¬ 
tion  Method  for  Diffuse  Radar-Target  Imaging  and 
Speckle-Noise  Reduction” ;  Journal  of  Mathematical 
Imaging  and  Vision,  3,  123-134,  1993. 

[Novak  94]  L.M.Novak  and  C.  J.Owirka,  “Radar  Tar¬ 
get  Identification  Using  an  Eigen-Image  Approach”; 
IEEE  National  Radar  Conference,  Atlanta,  CA, 
1994. 

[Novak  95]  L.M.Novak,  et.  al,  “Effects  of  Polar¬ 
ization  and  Resolution  on  the  Performance  of  a 
SAR  Automatic  Target  Recognition  System”;  Lin¬ 
coln  Laboratory  Journal,  Vol  8,  pp  49-68,  1995. 
[Novak  96]  L.M.Novak,  et.  al,  “ATR  Performance 
Using  Enhanced  Resolution  ATR”;  SPIE  Conf  on 
Algorithms  for  Synthetic  Aperture  Radar  Imagery 
III,  1996. 

[Odendaal  94]  J.W.Odendaal,  et.  al,  “Two- 
Dimensional  Super  resoltuion  Radar  Using  the  MU¬ 
SIC  Algorithm”;  IEEE  Transactions  on  Antennas 
and  Propagation,  V  42,  pp  1386-1391, 1994. 

[Oliver  90]  C.J. Oliver,  “Clutter  Classification  based 
on  a  correlated  noise  model”;  Inverse  Problems,  6, 


PP  77-89,  1990. 

[Owirka  94]  C.J.Owirka  and  L.M.Novak,  “A  New 
SAR  ATR  Algorithm  Suite”;  Conf  on  Algorithms 
for  Synthetic  Aperture  Radar,  1994. 

[Posner  93]  F.L.Posner,  “Texture  and  Speckle  in 
High  Resolution  Synthetic  Aperture  Radar  Clutter” ; 
IEEE  Trans  on  Geoscience  and  Remote  Sensing,  V 
31,  pp  192-203,  1993. 

[Wang  96]  B.  Wang  and  T.O.  Binford,  ’’Generic, 
Model-based  Estimation  and  Detection  of  Peaks  in 
Image  Surfaces”,  Proceedings  of  Image  Understand¬ 
ing  Workshop,  Vol.  2,  pp. 913-922,  Feb.  1996. 
[Wellman93]  R.  J.  Wellman,  et.  al,  “Radar  Cross 
Sections  of  Ground  Clutter  at  95  GHz  for  Sum¬ 
mer  and  Fall  Conditions”;  AGARD  meeting  on  At¬ 
mospheric  Propagation  Effects  through  Natural  and 
Man-Made  Obscurants  for  Visible  to  MM-Wae  Ra¬ 
diation,  1993. 


1034 


Figure  1:  BTR60:  Right  Front 


XPATCH  DATA:  azimuth  angle=  6,12,18,30,36  and  42 


Figure  1:  A  BTR60,  its  XPatch  images,  and  intensity  profiles  along  its  leading  edge;  (a)  an 
optical  image  of  a  BTR60;  (b)  XPatch  images  of  the  BTR60  over  a  range  of  angles;  (c)  intensity 
profiles,  a  slice  through  the  XPatch  image  along  the  leading  edge  of  the  target. 


azimuth  angle,  degree 


EXAMPLE  OF  PERSISTENT  SCATTERERS:  BTR60 


BTR60:  mask=1,  detected  peaks,  azimuth  angles  45  to  135 


Figure  2:  (a)  For  a  BTR60,  peaks  for  view  angles  differing  by  1  degree  are  rotated  and  super¬ 
imposed;  (b)  clusters  around  the  wheels  are  isolated;  (c)  a  3d  plot  of  clusters  around  wheel  in 
(x,y)  with  azimuth  along  the  z  axis. 


Example :  BTR6  0 


range  direction  range  direction 


(a) 


(b) 


Delaunay  Triangulation  Wheels  of  Target,  azimuth  =  45  deg. 


0  10  20  30  40  50  60  0  10  20  30  40  50  60 

range  direction  range  direcfon 


(c) 


(d) 


Figure  3:  (a)  The  XPatch  image  for  a  BTR60  at  45  degrees;  (b)  the  Delaunay  triangulation 
establishes  relations  among  peaks  in  the  target  cluster;  c)  peaks  from  the  target  with  the  per¬ 
pendicular  leading  edge  pair  superimposed;  candidates  for  the  leading  edges  are  marked  with 
circles  for  the  longer  leading  edge,  with  crosses  for  the  shorter;  (d)  peaks  that  lie  along  the 
leading  edge  are  selected  on  intensity;  ratios  of  wheel  spacings  match  those  of  the  BTR60. 
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Figure  4:  Estimates  of  azimuth  of  three  targets,  BTR60,  KTANK,  and  BMP  over  10  to  170 
degrees. 
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Abstract 

The  goal  of  the  project  described  here  is  to 
perform  recognition  of  3D  objects  in  clut¬ 
tered,  unstructured  environments.  The  ob¬ 
jects  (such  as  targets)  can  be  partially  oc¬ 
cluded.  Such  unconstrained  object  recog¬ 
nition  involves  high  computational  com¬ 
plexity,  and  we  will  apply  new  methods  to 
drastically  reduce  this  complexity.  These 
methods  include;  i)  Use  of  invariant  con¬ 
straints.  Invariance  has  been  proven  suc¬ 
cessful  for  planar  objects,  but  until  recently 
it  could  not  be  applied  to  images  of  general 
3D  objects,  ii)  Use  of  curve  features  such 
as  (non-coplanar)  conics  rather  than  point 
features.  The  system  developed  will  work 
in  two  main  modes:  i)  multiple  views;  ii)  a 
single  view  plus  modeling,  such  as  a  known 
data-base  or  3D  skew-symmetry.  Reliabil¬ 
ity  will  be  of  major  concern  in  both  modes. 

1  Introduction 

We  describe  here  a  new  concept  for  a  system  that 
detects  and  identifies  objects,  either  from  a  single 
image  or  from  two  or  more  images.  The  objects  will 
be  recognized  in  an  unconstrained  and  unstructured 
environment.  The  system  will  be  tested  on  real  im¬ 
agery  available  to  the  project. 

Recognizing  objects  has  been  a  major  goal  of  lU  re¬ 
search,  but  major  obstacles  have  been  encountered. 
Among  them: 

(i)  Complexity.  In  a  typical  recognition  system,  an 
observed  object  is  compared  to  a  database  of  mod- 

This  project  will  be  supported  by  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  (ARPA  Order  No. 
E655)  and  the  U.  S.  Air  Force  Wright  Laboratory.  For 
further  information  see 
http://www.cfar.umd.edu/-weiss/atr.html  . 


els.  Recognition  means  a  positive  match  between  the 
object  and  one  of  the  models.  Matching  of  visual 
data  is  very  costly  computationally,  unlike  match¬ 
ing  of  alphanumeric  data.  Given  a  large  number 
of  objects,  and  a  large  database  of  models,  we  face 
a  difficult  “high  complexity”  problem  of  comparing 
each  observed  object  to  each  model.  The  complexity 
increases  greatly  when  the  number  of  visible  objects 
and  models  is  increased.  Currently  this  poses  a  se¬ 
vere  limit  on  the  the  number  of  objects  that  can  be 
recognized.  Using  a  new  mathematical  theory,  we 
will  reduce  the  complexity  by  orders  of  magnitude, 
increasing  the  capabilities  of  the  system  by  a  factor 
of  hundreds  or  thousands. 

(ii)  Viewpoint  dependency.  Compounding  the  prob¬ 
lem  of  complexity  is  the  fact  that  the  observed  im¬ 
age  of  an  object  depends  on  the  point  of  view  from 
which  the  object  is  observed.  We  would  like  to  store 
in  a  database  only  one  model  of  each  object.  Several 
current  systems  use  viewpoint-invariant  descriptors, 
i.e.  they  extract  from  the  observed  object  some  de¬ 
scriptors  which  are  invariant  to  the  viewpoint  and 
compare  them  to  similar  descriptors  stored  in  a 
database.  However,  these  systems  are  limited  to  ob¬ 
jects  that  are  planar  or  have  mostly  planar  contours. 
The  proposed  system  will  deal  invariantly  with  ob¬ 
jects  of  arbitrary  three-dimensional  shapes. 

(hi)  Projection.  When  a  three-dimensional  object 
is  projected  into  a  two-dimensional  image,  “depth” 
information  is  lost.  Two  objects  which  look  the  same 
in  the  image  can  be,  in  principle,  very  different  in 
the  3D  reality.  This  can  be  solved  by  using  multiple 
images,  but  then  we  again  encounter  the  complexity 
problem  as  we  try  to  match  features  of  the  diflferent 
views.  Here  again  we  increase  the  efficiency  relative 
to  previous  methods  by  orders  of  magnitude. 

(iv)  Reliability.  This  is  a  major  problem  in  current 
systems.  Most  measurements  on  real  images  are  sub¬ 
ject  to  significant  noise  and  errors.  We  will  deal  with 
this  in  several  ways.  One  way  is  cross- verification. 
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When  observing  an  object  such  as  an  airplane,  we 
will  not  be  satisfied  with  matching  it  with  a  model 
airplane  in  the  database.  We  will  also  want  to  make 
definite  identifications  of  parts  of  the  object,  such 
as  wings  or  engines.  Only  when  several  levels  of 
parts  and  subparts  of  the  object  have  been  iden¬ 
tified  will  we  make  a  positive  identification  of  the 
whole  object.  This  is  made  possible  by  our  capa¬ 
bility  discussed  earlier,  namely  that  we  can  identify 
many  more  objects,  and  therefore  sub-objects,  than 
current  systems. 

Another  way  to  increase  reliability  is  to  use  more 
reliable  features  extracted  from  the  object.  Most 
systems  rely  on  distinctive  points  or  lines  observed 
on  the  objects.  In  addition  to  these  we  will  also  use 
conics  such  as  circular  or  elliptical  parts  of  the  ob¬ 
ject.  These  are  usually  more  reliable  since  each  conic 
consists  of  many  points,  so  that  the  errors  associ¬ 
ated  with  measuring  each  point  tend  to  average  out. 
The  use  of  conics  will  also  reduce  the  complexity, 
because  there  are  far  fewer  of  them  than  there  are 
points.  Previous  methods  used  only  coplanar  conics. 
We  will  use  conics  in  general  3D  configurations. 

Previously  used  invariant-based  methods  have  suf¬ 
fered  from  the  fact  that  the  invariant  quantities  are 
sometimes  susceptible  to  errors.  To  overcome  this 
problem,  in  addition  to  the  above  methods,  we  will 
include  in  our  algorithm  a  final  matching  and  veri¬ 
fication  stage  which  is  independent  of  invariants. 

2  Technical  Approach 

The  conventional  approach  to  object  recognition  in¬ 
volves  finding  a  correspondence  between  features  in 
the  image  and  features  in  given  models  (e.g.,  tem¬ 
plate  matching;  see  Ballard  and  Brown,  1982).  One 
extracts  prominent  features  from  the  image  and  tries 
to  match  them  with  features  in  a  model.  Given 
the  coordinates  of  k  features  in  the  image  and  the 
coordinates  of  k  features  in  a  possibly  correspond¬ 
ing  model,  we  calculate  hypothetical  parameters  of 
the  viewpoint,  or  pose,  between  the  image  and  the 
model.  This  can  include  rotation,  translation  and 
other  parameters.  Repeating  the  calculation  for 
other  fc-tuples,  the  results  will  cluster  into  the  cor¬ 
rect  value  of  the  pose  parameters  and  into  the  correct 
model,  which  thus  identifies  an  object  in  the  image 
with  this  model.  Having  a  number  n  of  I;-tuples 
of  features  and  m  models,  we  have  to  check  0{mn) 
possible  matches,  i.e.  a  number  proportional  to  mn. 

Several  problems  arise  in  this  process.  (1)  Projec¬ 
tion.  The  object  we  observe  as  well  as  the  models 
in  the  database  are  three-dimensional,  but  the  pro¬ 
jection  in  the  image  is  two-dimensional.  Thus  the 


depth  information  is  lost  and  cannot  be  matched. 
(2)  Viewpoint  dependency.  An  image  of  an  object 
depends  on  the  point  of  view,  including  translation 
and  rotation.  (3)  Complexity  of  finding  the  corre¬ 
spondence.  This  is  in  fact  a  consequence  of  view¬ 
point  dependency.  The  fact  that  the  visible  object 
differs  from  the  model  in  several  parameters  makes 
it  impossible  to  match  features  without  trying  many 
matches.  The  number  0{mn)  of  possible  matches 
mentioned  above  can  be  very  large.  Although  there 
is  no  need  to  check  all  possible  A;-tuples  of  features 
in  the  image,  the  number  we  do  check  is  usually 
many  thousands.  When  we  have  hundred  of  mod¬ 
els,  the  number  of  trial  matches  becomes  prohibitive. 
(4)  Reliability.  Images  contain  noise  and  errors  that 
corrupt  measurements  of  the  feature  coordinates. 

A  common  way  to  solve  the  projection  problem  is 
by  using  multiple  images  (stereo,  motion).  This  can 
provide  the  missing  depth  information.  The  problem 
here  is  that  now  we  must  find  the  correspondence 
between  features  in  at  least  two  images.  The  com¬ 
plexity  here  is  O(n^),  assuming  we  use  n  fc-tuples  in 
both  images.  This  is  in  addition  to  the  problem  we 
had  earlier,  of  finding  the  correspondence  between 
image  features  and  model  features. 

Our  approach  will  handle  all  the  problems  men¬ 
tioned  above.  It  will  reduce  the  complexity  by  or¬ 
ders  of  magnitude,  while  solving  the  viewpoint  de¬ 
pendency  and  projection  problems. 

At  the  heart  of  our  approach  to  dealing  with 
the  problems  described  above  are  discoveries  about 
the  mathematical  relations  between  a  3D  object 
and  its  2D  projections.  These  are  described  in 
[Weiss96a,  Weiss96b] . 

The  key  to  solving  the  complexity  problem  is 
to  make  all  our  subsequent  treatment  viewpoint- 
invariant.  A  viewpoint  change  is  generally  repre¬ 
sented  by  a  projective  transformation.  However, 
when  an  object  is  distant  from  the  camera  (relative 
to  the  focal  length),  a  projective  transformation  is 
very  well  approximated  by  an  affine  one.  This  con¬ 
dition  is  usually  met  in  practice.  An  affine  trans¬ 
formation  can  include  translation,  rotation,  scaling, 
skewing  and  reflection.  The  latter  includes  switch¬ 
ing  from  one  half  of  a  skew-symmetric  image  of  an 
object  to  the  other  half.  Instead  of  the  fc-tuples 
of  features,  we  will  use  quantities,  calculated  from 
the  coordinates  of  the  features,  that  are  invariant  to 
affine  transformation.  These  invariants  will  be  used 
for  matching  both  in  the  image-to-model  case  and 
the  image-to-image  (stereo,  motion)  case.  In  addi¬ 
tion  to  ifc-tuples  of  pointlike  features  we  will  deal  with 
conics  and  use  their  invariants. 
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Invariants  have  been  successfnlly  used  before  for 
planar  objects  (e.g.  [Weiss88;  Mundy92;  Weiss93a; 
Weiss93b;  Rivlin95]).  There  are  no  true  invariants 
of  the  projection  from  3D  to  2D,  but  one  can  find  in¬ 
variant  constraints  that  serve  a  similar  purpose  ([Ja- 
cobs92;  Stiller94;  Weiss96a;  Weiss96b]). 

We  now  describe  the  invariant  mathematical  relation 
between  a  3D  point  set  and  its  2D  projection.  De¬ 
tails  are  in  [Weiss96a].  We  have  a  5-tuple  of  points, 
with  3D  coordinates  X,-,  i  =  1 . .  .5.  They  are  pro¬ 
jected  onto  Xj  in  the  image. 

Since  determinants  are  (relative)  invariants  of  an 
affine  transformation,  we  look  at  the  determinants 
formed  by  these  points  in  both  3D  and  2D.  Any  four 
of  the  five  points  in  3D,  expressed  in  four  homoge¬ 
neous  coordinates,  define  a  determinant  Mj.  We  give 
the  determinant  the  same  index  as  the  fifth  point 
that  was  left  out.  For  example. 

Ml  =  |X2,X3,X4.X5| 

Similarly,  in  the  2D  projection,  any  three  of  the  five 
points  define  a  determinant  mjj ,  with  indices  equal 
to  those  of  the  points  that  were  left  out,  e.g. 

mil  =  |X3,X4,X5| 

It  can  be  shown  by  simple  algebraic  methods  that 
the  3D  and  2D  invariants,  Mj  and  are  related 
by  the  equations 

Mzmi2  +  Ml  17125  -  M2mi5  =  0 

Mzmi3  +  Mimas  -  M^miz  =  0 

These  relations  are  obviously  invariant  to  any  affine 
transformation  in  both  3D  and  2D.  A  3D  transfor¬ 
mation  will  merely  multiply  all  the  M,  by  the  same 
constant  factor,  which  drops  out  of  the  equations. 
A  2D  affine  transformation  multiplies  all  the  rriij  by 
the  same  constant  factor,  which  again  drops  out. 

The  2D  invariants  rriij  are  calculated  from  mea¬ 
surements  on  the  image.  Thus  the  above  relations 
are  linear  equations  for  the  unknown  3D  (relative) 
invariants  M,-.  There  are  three  independent  un¬ 
knowns,  namely  the  absolute  invariants: 

Ml  M2  M3 

Ms’  Ms’  Ms'  ^  ’ 

Since  we  only  have  two  equations,  we  can  determine 
the  unknowns  only  up  to  one  linear  parameter.  This 
remaining  unknown  represents  the  missing  depth  in¬ 
formation. 

3  Implementation  Plan 

Based  on  the  above,  we  can  build  a  3D  invariant 
space  in  which  recognition  will  take  place.  Each  3D 


•  model  1 


o  model  2 

Figure  1:  Single  view  recognition 

5-tuple  will  be  represented  as  a  point  in  this  space, 
with  three  invariant  coordinates  being  the  quantities 
shown  in  eq.  (1).  A  model  will  be  represented  by  a 
set  of  points  in  this  space.  (There  will  also  be  some 
way  of  identifying  these  points  as  part  of  the  same 
model.)  The  problem  is  now  how  to  match  the  image 
to  the  model.  We  will  distinguish  several  cases. 

3.1  Single  image 

This  case  is  the  simplest  but  least  reliable.  Here  we 
extract  5-tuples  of  features  from  the  image  and  cal¬ 
culate  their  2D  invariants  rriij.  Substituting  them 
in  the  above  equations,  we  obtain  two  equations  re¬ 
lating  the  three  3D  absolute  invariants  of  eq.  (1). 
Geometrically,  we  obtain  a  line  in  our  3D  invariant 
space  (Fig.  1).  If  a  5-tuple  in  the  2D  image  is  a  pro¬ 
jection  of  some  5-tuple  in  3D,  then  the  line  obtained 
from  this  2D  5-tuple  will  pass  through  the  point  in 
invariant  space  representing  the  3D  5-tuple.  To  rec¬ 
ognize  objects  we  thus  look  for  instances  in  which 
lines  obtained  from  the  image  pass  through  points 
representing  models  in  the  invariant  3D  space. 

While  this  might  work  for  a  small  number  of  mod¬ 
els,  we  suspect  that  with  many  models  and  objects, 
producing  many  inexact  lines  and  points,  there  will 
be  many  lines  that  pass  through  or  near  points  that 
do  not  belong  to  them  in  reality.  We  need  to  find 
a  way  to  eliminate  the  extra  parameter  along  each 
line  and  reduce  it  to  a  point. 
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view  2 


Figure  2:  Two  view  matching 


3.2  Multiple  images 

The  first  stage  here  is  similar  to  the  above,  i.e.  ex¬ 
tract  a  5-tuple  of  features  from  one  of  the  images  and 
transform  it  into  a  line  in  the  3D  invariant  space. 

Next,  we  look  at  a  different  image,  in  which  objects 
are  seen  from  a  different  viewpoint.  Again  we  draw 
a  line  in  the  same  3D  space  of  invariants  as  before. 
If  this  line  meets  the  first  line  in  3D  then  the  two  5- 
tuples  have  the  same  3D  invariants  (Fig.  2).  These 
are  the  coordinates  of  the  intersection  point.  This 
means  that  the  two  5-tuples  are  affine  equivalent. 
This  in  turn  indicates  that  we  may  have  two  different 
views  of  the  same  5-tuple. 

With  n  5-tuples,  the  total  number  of  lines  in  the 
invariant  space  is  0{n).  We  can  find  line  intersec¬ 
tions  in  a  way  similar  to  that  used  in  Hough  space, 
namely  divide  the  space  into  bins  and  see  if  a  certain 
bin  has  more  than  one  line  going  through  it.  We  do 
not  need  to  check  all  bins;  we  only  need  to  go  along 
the  known  lines.  A  hierarchical  scale-space  approach 
can  be  used  to  make  the  process  more  efficient.  The 
exact  technique  and  its  cost  need  to  be  investigated, 
but  it  should  be  about  0{n).  Therefore  our  total 
complexity  is  0{n)  rather  than  O(n^)  as  in  previous 
methods. 

Based  on  the  above  discussion,  our  recognition  algo¬ 
rithm  for  multiple  views  involves  the  following  steps; 

i)  Feature  extraction.  Find  candidate  5-tuples  that 
are  potentially  affine  equivalent.  These  can  be  sets 
of  points  in  some  window  placed  in  a  similar  location 
in  both  images,  and  they  can  overlap.  Various  clues 
can  be  used  to  prune  unpromising  sets,  e.g.  non¬ 
matching  links  that  are  visible  between  the  points, 
or  parallelism. 


ii)  Invariant  description.  For  each  5-tuple  that  re¬ 
mains  from  (1),  and  for  each  view,  calculate  the  two 
equations  for  the  three  3D  invariants,  namely  plot  a 
line  in  a  3D  invariant  space.  Find  all  lines  that  meet 
in  3D.  Each  intersection  represents  the  3D  affine  in¬ 
variants  of  two  affine  equivalent  5-tuples. 

iii)  Recognition.  We  now  have  invariant  3D  point 
sets  representing  various  visible  objects.  Similarly, 
the  models  in  the  database  can  be  represented  by 
point  sets  in  the  same  invariant  3D  space.  We  will 
use  several  5-tuples  in  each  model,  obtaining  several 
points  representing  the  model  in  the  invariant  space. 
Identification  is  now  straightforward.  If  an  object’s 
point  set  falls  on  a  model’s  point  set  in  the  invariant 
space,  than  the  object  is  identified  with  the  model. 
No  search  is  needed  to  find  the  right  model,  because 
the  invariant  point  sets  of  the  models  can  be  indexed 
according  to  their  coordinates.  No  search  is  needed 
for  the  viewpoint  or  pose  either  because  of  the  in¬ 
variance.  As  many  points  as  practical  will  be  used 
to  increase  reliability. 

iv)  Verification.  This  step  is  independent  of  the  in¬ 
variants  method.  It  overcomes  any  errors  that  we 
may  have  in  calculating  the  invariants.  Using  the 
2D  coordinates  of  the  images,  and  the  correspon¬ 
dence  found  in  (iii),  we  calculate  the  3D  coordinates 
of  the  features.  Now  we  can  find  the  3D  transforma¬ 
tion  (pose)  that  produces  the  best  fit  between  the  3D 
object  and  the  model  identified  in  (iii),  using  least 
squares  fitting.  The  identification  is  rejected  if  the 
fitting  error  is  too  big.  We  may  try  to  fit  several 
models  to  each  object  to  find  the  best  fit. 

3.3  Single  image,  symmetric  models 

The  problem  encountered  earlier  in  the  single-image 
case  was  the  unknown  parameter  along  the  line,  re¬ 
sulting  from  the  missing  depth  information.  The 
only  way  to  compensate  for  this  in  a  single  image  is 
to  use  a  model-based  approach,  i.e.  make  a  modeling 
assumption  about  the  3D  object. 

One  modeling  assumption  that  we  will  use  is  sym¬ 
metry.  Most  man-made  objects  are  symmetric,  e.g. 
vehicles,  tanks,  airplanes  and  buildings.  Symmetry 
does  not  need  to  be  bilateral;  the  Pentagon  building 
is  an  example  of  a  five-fold  symmetry.  Symmetry  is 
also  found  in  human  and  animal  bodies.  The  (Eu¬ 
clidean)  symmetry  is  observed  explicitly  only  in  a 
frontal  view.  In  any  other  view  the  symmetry  is 
observed  as  a  skew-symmetry  (i.e.  it  is  an  affine 
or  projective  symmetry).  Many  researchers  have 
used  skew-symmetry  for  recognition,  but  with  seri¬ 
ous  limitations.  They  usually  assume  that  the  skew- 
symmetry  itself  is  known,  i.e.  we  know  which  feature 
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in  one  half  of  the  object  corresponds  to  which  feature 
in  the  symmetric  half.  In  other  words,  they  assume 
that  the  correspondence  problem  has  already  been 
solved.  Here  we  will  make  no  such  assumption  but 
will  detect  the  skew-symmetric  objects  in  an  image. 

The  two  halves  (or  three  thirds,  etc.)  of  a  skew- 
symmetric  object  are  affine  equivalent.  Therefore 
we  can  apply  the  algorithm  described  above  which 
was  designed  to  find  affine  equivalent  5-tuples.  Hav¬ 
ing  found  matching  5-tuples,  we  have  to  verify  that 
they  are  halves  of  the  same  object.  The  lines  con¬ 
necting  corresponding  points  in  a  symmetric  object 
are  parallel  in  3D,  therefore  they  will  be  parallel  in 
an  affine  projection,  and  this  is  easy  to  check. 

The  verification  step  (iv)  is  easier  because  of  the 
symmetry  assumption.  The  skew-symmetric  object 
that  we  have  found  can  be  rectified,  using  an  affine 
transformation,  to  obtain  a  standard  view  in  which 
the  object  is  (non-skew)  symmetric.  It  can  then  be 
matched  directly  with  a  database  of  symmetric  mod¬ 
els. 

4  Non-pointlike  Features 

The  above  discussion  deals  with  point-like  features. 
Other  types  of  features  can  be  useful,  e.g.  lines,  con¬ 
ics  and  higher  order  curves.  Conics  arise  in  many 
cases  from  circular  parts  of  objects,  projected  as  el¬ 
lipses.  There  are  several  advantages  to  using  them. 
(1)  There  are  far  fewer  of  them  than  there  are  points. 
Therefore  the  complexity  problem  is  greatly  allevi¬ 
ated.  (2)  They  are  more  reliable  than  points,  be¬ 
cause  each  conic  is  made  up  of  many  points  whose 
measurement  errors  tend  to  average  out. 

Earlier  work  used  invariants  of  a  pair  of  conics  for 
recognition.  A  serious  limitation  was  that  the  two 
conics  had  to  be  coplanar.  In  [Weiss96b]  we  remove 
this  restriction  and  deal  with  arbitrary  conics  in  3D. 
Given  two  uncalibrated  images  of  a  pair  of  conics 
in  an  arbitrary  3D  configuration,  we  find  the  corre¬ 
spondence  properties  (the  epipolar  geometry)  of  the 
two  images,  and  the  parameters  of  the  two  conics, 
up  to  a  3D  affine  transformation. 

Based  on  this  we  can  devise  a  recognition  algorithm 
as  follows.  Given  two  images,  we  pick  a  pair  of  con¬ 
ics  in  each.  Using  our  method  we  find  the  epipolar 
geometry  parameters  of  the  images.  We  repeat  the 
process  with  other  conic  pairs.  With  several  such 
pairs,  the  parameters  of  the  epipolar  geometry  will 
cluster  around  the  correct  values.  The  complexity 
here  is  low  because  of  the  small  number  of  conics. 
From  this  we  can  find  the  3D  parameters  of  all  con¬ 
ics  (up  to  a  global  affine  transformation,  i.e.  we  can 
find  affine  invariant  versions  of  the  visible  objects). 


These  will  be  compared  to  the  appropriate  invariant 
parameters  of  conics  from  the  models.  These  latter 
invariants  will  be  indexed  in  a  database  of  models. 

This  algorithm  can  be  combined  with  the  point-like 
feature  based  algorithm.  Finding  the  epipolar  ge¬ 
ometry  makes  it  easier  to  find  the  correspondence 
between  point-like  features  in  a  stereo  pair,  further 
reducing  the  complexity  involved  in  the  algorithm 
described  earlier. 

5  Reliability 

Reliability  lies  in  numbers.  Any  system  that  has  to 
deal  with  noisy  inaccurate  elements  needs  large  num¬ 
bers  of  them  to  obtain  a  reliable  output.  There  is 
no  shortage  of  information  elements  even  in  a  single 
image;  the  problem  is  how  to  put  them  together.  We 
can  increase  the  numbers,  and  thus  the  reliability  of 
our  approach,  in  several  ways; 

1)  Point  sets  will  be  extracted  not  as  isolated  fea¬ 
tures  but  as  intersections  of  lines.  A  line  contains 
many  points  and  is  more  reliable  than  a  single  point. 

2)  Many  point  sets  will  be  used  for  a  single  object.  A 
match  of  one  or  two  point  sets  may  not  be  very  reli¬ 
able,  but  several  matches  involving  the  same  object 
can  yield  a  positive  identification. 

3)  A  verification  stage,  independent  of  the  invari¬ 
ance  method,  is  added  as  described  earlier  in  step 
(iv).  Several  candidate  models  will  be  fitted  to  each 
visible  object  and  the  best  fitting  model  will  be  cho¬ 
sen. 

4)  Curves  such  as  conics  will  be  used  in  addition  to 
points,  as  described  earlier. 

5)  Parts  of  objects  will  be  recognized  independently, 
e.g.  the  wings  of  an  aircraft,  to  provide  verification. 

All  these  methods  will  be  tested  as  described  in  the 
next  section. 

6  Evaluation  Plan 

Our  main  concern  in  evaluating  our  system’s  perfor¬ 
mance  is  reliability.  We  have  proposed  several  ways 
to  increase  reliability,  including  the  use  of  many  fea¬ 
tures  per  object,  and  the  addition  of  a  verification 
stage  which  is  independent  of  invariants.  The  ef¬ 
fect  of  these  and  other  factors  on  reliability  will  be 
tested. 

The  reliability  of  the  system  in  terms  of  detection 
rate  and  false  alarms  depends  on  many  variables. 
Some  of  these  variables  are  inherent  in  the  data, 
while  others  are  specific  to  the  system.  The  vari- 
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ables  of  the  images  are  not  under  our  control  since  we 
will  use  government-supplied  real  imagery  .  These 
variables  include  noise  and  errors  in  the  images,  the 
apparent  object  sizes,  the  angles  of  view,  and  the 
disparities  between  stereo  images.  The  system  vari¬ 
ables  are  under  our  control.  These  include  the  num¬ 
ber  of  models  in  the  database,  the  size  of  the  win¬ 
dows  in  which  we  are  looking  for  the  objects,  the 
number  of  features  used  per  model,  and  to  some  ex¬ 
tent  the  choice  of  features. 

The  dependence  of  the  detection/false  alarm  rate  on 
these  variables  is  understood  qualitatively.  The  rate 
is  higher  if  we  have  lower  noise,  bigger  apparent  size 
of  the  objects,  fewer  models  in  the  database,  etc. 
The  evaluation  plan  will  aim  at  quantifying  these 
dependencies.  It  will  consist  of  the  following: 

1)  Conduct  experiments  to  determine  the  quantita¬ 
tive  dependence  of  the  detection/false  alarm  rate  on 
variables  that  are  under  our  control.  To  this  end  we 
will  keep  the  detection  rate  constant  (e.g.  100%  or 
close  to  it).  We  will  look  at  the  same  data,  and  we 
will  vary  the  system  variables  until  we  achieve  the 
desired  detection  rate  for  those  data.  In  this  way  we 
will  optimize  the  system  for  the  particular  dataset 
that  we  have  chosen. 

2)  Using  the  optimized  system  parameters,  we  will 
conduct  experiments  on  other  sets  of  data,  with  dif¬ 
ferent  characteristic  variables,  and  determine  the  de¬ 
tection/false  alarm  rates.  We  will  determine  quan¬ 
titatively  how  various  variables  which  are  not  under 
our  control  affect  the  performance. 

The  output  from  these  experiments  will  be  interpo¬ 
lated  as  a  surface  in  a  multidimensional  space.  The 
surface  “height”  will  represent  the  detection  rate, 
while  each  abscissa  will  represent  one  of  the  vari¬ 
ables  involved. 

Since  both  the  data  and  the  software  will  be  avail¬ 
able  to  the  lU  community,  similar  evaluations  can 
be  performed  by  other  groups  using  our  system.  To 
achieve  battlefield  utility,  very  high  standards  of  per¬ 
formance  will  be  needed,  such  as  detecting  hundreds 
of  objects  of  rather  small  apparent  sizes  in  a  battle¬ 
field  filled  with  smoke  and  dust. 
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Abstract 

The  New  York  University  (NYU)  project  on 
“Wavelet-based  Target  Hashing  for  Atutomatic 
Target  Recognition  (ATR)”  has  focused  on  the 
use  of  geometric  hashing,  and  related  object 
recognition  strategies,  for  the  detection  and  lo¬ 
calization  of  targets  in  digital  imagery,  particu¬ 
larly  in  Forward  Looking  Infrared  Radar  (FLIR) 
and  in  Sinthetic  Aperture  Radar  (SAR)  im¬ 
agery.  There  are  three  main  aspects  to  this 
work: 

1.  The  development  of  stable,  robust,  fea¬ 
ture  extraction  methods  that  apply  to  spe¬ 
cific  image  modalities,  and  enable  discrim¬ 
ination  based  on  feature  vectors  describ¬ 
ing  idealized  “interesting”  constructs  in  the 
sensor  data; 

2.  Theories  for  matching  extracted  features 
to  model  features,  whether  the  models  are 
predicted  or  observed  from  training  data, 
generally  assuming  that  there  exists  a  one- 
to-one  association  between  a  subset  of  the 
observed  features  with  a  subset  of  the 
model  features; 


‘Part  of  this  work  has  been  supported  by  the  Air 
Force  (AFOSR)  through  the  Grants  F49620-96-1-0159 
and  GC123913  NGD.  The  views  and  conclusions  con¬ 
tained  in  this  document  are  those  of  the  authors  and 
should  not  be  interpreted  cis  representing  the  official  poli¬ 
cies,  either  expressed  or  implied,  of  the  Advanced  Re- 
seeurch  Projects  Agency,  the  United  States  Government, 
or  NYU. 


3.  Studies  of  the  discriminability  of  targets  in 
actual  applications  by  making  use  of  geo¬ 
metric  hashing  software  for  the  recognition 
of  models  in  observed  imagery. 

1  Introduction 

In  addition  to  the  Automatic  Target  Detection 
and  Recognition  (ATD/R)  University  Research 
Initiative  funded  by  DARPA  for  the  develop¬ 
ment  of  these  technologies,  there  are  a  number 
of  related  research  projects  at  NYU  that  also 
feed  into  these  themes.  In  particular,  NYU  is  a 
prime  contractor  in  the  Moving  and  Stationary 
Target  Acquisition  and  Recognition  (MSTAR) 
program,  and  has  been  active  in  the  develop¬ 
ment  of  a  Match  Module  for  the  model-based 
SAR  ATR  system,  and  more  recently  has  been 
addressing  the  problem  of  global,  intelligent 
search  by  the  model-based  subsystem  of  the 
MSTAR  algorithm  suite.  We  briefly  summa¬ 
rize  the  relevance  of  the  MSTAR  work  below, 
noting  how  the  ATD/R  work  has  benefitted 
the  MSTAR  algorithm  development,  and  noting 
how  the  MSTAR  work  has  suggested  problems 
and  studies  conducted  in  the  ATD/R  project. 

In  addition,  we  have  been  working  with  a 
Florida  company  called  I-Math  Associates,  led 
by  Alexander  Akerman  III,  to  deliver  software 
for  geometric  hashing  for  ATR  applications.  We 
have  structured  this  software  in  such  a  way 
that  it  can  easily  be  modified  for  application  to 
MSTAR,  and  has  also  been  used  in  the  ATD/R 
project. 
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Finally,  we  have  been  conducting  a  project,  also 
supported  by  the  Air  Foce  (AFOSR),  focusing 
on  robust  recognition  including  articulated  ob¬ 
ject  contours.  While  this  work  has  not  been 
applied  to  articulated  target  recognition  using 
SAR  or  FLIR  targets  (mainly  due  to  the  lack  of 
data  of  articulated  vehicles),  we  expect  this  to 
change  in  the  near  future. 

The  full  body  of  work  can  be  reviewed  by  ac¬ 
cessing  http ; // s imulat ion . mods im . nyu . edu/ 
(look  under  “Projects.”  This  web  page  can 
also  be  located  from  http://cs.nyu.edu/  un¬ 
der  “Research  Projects”  and  then  under  “Com¬ 
puter  Vision  Group.” 

In  the  following,  we  highlight  four  topics.  We 
begin  with  the  disciminability  of  targers  men¬ 
tioned  in  (3),  as  listed  above.  We  then  study 
matching,  and  discuss  a  use  of  the  bipartite 
matching  algorithm  for  generalized  feature-to- 
feature  matching.  This  work  forms  part  of  topic 
(2).  We  move  to  topic  (1),  considering  feature 
extraction,  and  discuss  in  some  detail  the  cor¬ 
ner  and  multi-junctions  detector  that  has  been 
developed  at  NYU.  The  key  point  of  this  detec¬ 
tor  is  that  it  does  not  depend  on  a  prior  sep¬ 
arate  edge  extraction  process  and  has  been  ex¬ 
tensively  tested.  We  finally  discuss  the  scheme 
to  recognizing  articulated  objects  contours. 

2  Discriminability  of  Targets  from 
Features 

Overall,  the  NYU  projects  in  ATD/R  empha¬ 
size  object  classification  and  identification,  and 
presuppose  a  detection  process.  Accordingly, 
the  algorithms  form  part  of  a  body  of  work  in 
“Recognition  Theory”  that  attempts  to  match 
patterns  of  features  to  model  pattern  of  fea¬ 
tures.  The  emphasis  is  on  a  sufficient  degree 
of  recognition  to  be  able  to  distinguish,  for  ex¬ 
ample,  a  T72  from  an  M60  in  FLIR  imagery. 
Of  course,  we  depend  on  a  sufficient  resolu¬ 
tion  so  as  to  extract  enough  pixels  on  each 
target.  Ultimately,  a  sufiicient  number  of  fea¬ 
tures  are  required  so  as  to  disambiguate  the 
observed  object  from  among  the  collection  of 
models,  and  from  the  model  for  clutter  and  non¬ 
specific  targets.  Some  separate  theoretical  anal¬ 


yses  suggest  that  this  is  possible  if  there  are 
something  like  30  or  35  individual,  independent 
“features”  on  each  target,  where  a  feature  (in 
this  case)  means  a  single  coordinate  (of  a  possi¬ 
bly  higher-dimensional  feature),  represented  by 
a  real  number.  This  ballpark  estimate  makes 
many  assumptions,  and  will  vary  according  to 
the  background  complexity  and  the  specificity 
of  the  features  on  the  target.  However,  for  the 
kind  of  target  versus  background  discrimination 
that  is  required  using  high-resolution  imagery  so 
as  to  keep  false  alarm  rates  to  tolerable  levels 
(something  less  than  one  false  alarm  per  square 
kilometer  of  SAR  imagery,  for  example),  these 
estimates  provide  some  guidance.  Of  course, 
since  35  independent  featmes  might  come  from 
a  dozen  locations  in  the  image,  each  with  an  at¬ 
tribute  measure,  we  conclude  that  recognition 
should  be  possible  given  perhaps  a  thousand 
pixels  on  the  target.  Fewer  pixels  on  target 
might  suffice  when  there  are  a  dozen  significant 
location  features  (with  attributes)  that  can  be 
extracted,  but  sufficient  resolution  is  the  key  to 
being  able  to  extract  a  sufficient  number  of  fea¬ 
tures. 

3  Bipartite  Matching 

The  bipartite  matching  problem  can  be  de¬ 
scribed  as  the  problem  of  determining  a 
minimum-cost  one-to-one  assignment  of  nodes 
from  one  collection  to  another  collection,  where 
the  total  cost  is  the  sum  of  nonnegative  indi¬ 
vidual  constituent  costs.  In  the  field  of  com¬ 
puter  vision,  the  object  recognition  problem  can 
often  be  formulated  as  a  problem  of  match¬ 
ing  a  collection  of  model  features  to  extracted 
features,  where  the  model  featmes  are  suit¬ 
ably  transformed  into  the  image  domain  and 
where  noise  and  obscuration  can  lead  to  no¬ 
match  conditions.  Under  the  assumptions  that 
the  features  are  conditionally  independent,  it  is 
possible  to  formulate  a  Bayesian  match  met¬ 
ric  as  a  sum  of  individual  terms  based  on  the 
correspondences  between  model  and  image  fea¬ 
tures.  This  model  provides  an  explicit  represen¬ 
tation  for  noise  and  obscuration,  manifested  as 
unmatched  extracted  features  and  unmatched 
model  features.  The  optimal  assignment  prob- 
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lem  can  be  seen  as  a  matching  problem  in  bi¬ 
partite  graphs,  and  thus  admits  an  efficient  so¬ 
lution. 

Our  Model 


features  as  follows.  Let 

Cij  =  log{pi)+logfi{xj-yi)-\ogpoixj) 
Oi  =  log(l-pi)  (2) 

bj  =  log pj{xj)  -  log po{xj) 


In  model-based  computer  vision,  predefined 
models  are  hypothesized  to  be  present  in  the 
scene  and  evidence  is  sought  in  the  sensed  im¬ 
age  for  supporting  or  rejecting  each  hypothesis. 
Search  strategies  are  used  to  guide  the  hypoth¬ 
esis  evidence  accrual  process,  which  continues 
until  either  a  decision  is  made  about  the  identity 
of  the  particular  model(s)  present  and  the  geo¬ 
metric  transformation  that  best  superimpopses 
the  model  features  into  the  image,  or  an  empty 
decision  is  advocated,  implying  that  the  statis¬ 
tical  evidence  is  not  strong  enough  to  support 
any  of  the  current  model  hypotheses. 

The  object  recognition  problem  then  consists  of 
finding  an  instance  of  (a  subset  of)  the  model 
embedded  in  the  image  subject  to  invariance  un¬ 
der  some  group  of  transformations.  We  accom¬ 
plish  this  recognition  by  matching  model  fea¬ 
tures  to  extracted  features,  and  comparing  the 
corresponding  feature  values.  We  axe  concerned 
with  the  efficient  evaluation  of  a  match  metric 
in  order  to  compare  a  hypothesized  model  with 
an  observed  scene.  There  are  two  aspects  to 
this  comparison:  (1)  Finding  correspondences 
between  extracted  and  model  features,  after  the 
model  has  been  suitably  transformed  to  opti¬ 
mize  the  ultimate  score,  and  (2)  computing  the 
score  based  on  those  associations  by  means  of 
a  Bayesian  metric  for  the  distance  between  the 
sets  of  features. 

The  Bayesian  match  score  can  be  expressed  as 


L(X,r  I  Tfc,0o) 
L(X  I  To) 


) 


]^(log(pi)  -I-  log/i(xi  -  Yi)  -  logpo(xi))  +(1) 
r 


53log(l  -Pi)  +  5](logpi(xi)  -  logpo(xi)). 
Y“  X“ 


which  can  be  written  as  a  sum  of  weights  for  a 
given  assignment  of  model  features  to  extracted 


Then  consider  Zij  to  be  a  permutation  matrix 
that  will  assign  feature  yi  to  feature  Xj  by  set¬ 
ting  Zij  =  1.  Likewise,  we  will  set  variables 
yi  =  1  when  yj  is  unmatched,  and  Xj  —  1  when 
Xj  is  unmatched;  all  other  variables  Zij,  yi  and 
Xj  are  set  to  zero.  Then  using  the  definitions 
above,  and  in  view  of  equation  (1)  the  problem 
of  chosing  correspondences  so  as  to  maximize 
the  resulting  score  becomes 


maximize  EE  Cij  Zij  +  X/  E  bjXj  (3) 

i  j  i  j 

subject  to  the  following  constraints  on  the  vari¬ 
ables: 


i 

=  1 

for  i 

=  1,2,.. 

.,771(4) 

J 

=  1 

for  j 

=  1,2,.. 

.,s(5) 

% 

Ziji  Xj,  yj 

IV 

o 

for  all 

(6) 

The  constraints  represent  the  fact  that  for  each 
predicted  feature,  either  there  is  exactly  one 
match  to  an  extracted  feature,  or  the  feature  is 
unmatched,  but  not  both.  Likewise,  for  each  ex¬ 
tracted  feature,  either  there  is  a  match  to  a  cor¬ 
responding  predicted  feature,  or  the  extracted 
feature  is  unmatched,  but  not  both. 

Thus  the  original  problem  can  be  seen  as  bipar¬ 
tite  graph  matching,  where  the  m  predicted  fea¬ 
tures  are  supplemented  with  s  extra  nodes,  one 
for  each  extracted  feature  to  serve  as  potential 
no-match  nodes,  and  the  s  extracted  features 
are  supplemented  with  m  no-match  nodes,  one 
for  each  predicted  feature  (Figure). 

As  a  result  of  these  observations,  we  see  that  the 
problem  of  finding  an  association  between  pre¬ 
dicted  model  features  and  extracted  scene  fea¬ 
tures  in  such  a  way  as  to  maximize  a  score  that 
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Figure  1:  Peak  correspondences  of  a  military 
truck. 


accounts  for  penalties  for  unmatched  predicts 
and  unmatched  extracts  can  be  converted  into 
a  bipartite  matching  problem  that  can  be  eas¬ 
ily  solved  using  the  Hungarian  algorithm.  Thus 
a  combinatorial  number  of  possible  assignments 
can  be  examined  in  polynomial  time,  and  an 
optimal  assignment  is  obtained  in  terms  of  the 
match  metric  based  on  a  Bayesian  scoring.  The 
essential  ingredient  permitting  this  optimal  as¬ 
sociation  is  the  structure  of  the  match  metric  as 
a  sum  of  nonnegative  weights,  and  this  formu¬ 
lation  in  turn  depends  on  using  log-likelihoods 
and  conditional  independence  assumptions. 

4  Junction  Detection 

Corners,  T-,  Y-,  X-junctions  give  depth  cues 
which  is  a  critical  aspect  of  image  understand¬ 
ing  tasks  such  as  object  recognition:  junctions 
form  an  important  class  of  features  invaluable 
in  most  vision  systems. 

The  three  main  issues  in  a  junction  (or 
most  feature)  detector  are:  (1)  Scale;  (2) 
Location;  (3)  Junction  (feature)  parameters. 
The  junction  parameters  are:  The  radius  or 
size;  The  number  of  wedges  (lines,  corners. 
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Figure  2:  Peak  correspondences  between  a 
predicted  truck  model  and  a  SAR 
image  of  a  tank. 


3-junctions  such  as  T  or  Y,  or,  4-junction 
such  as  X-junction,  etcetera);  The  angles 
of  each  wedge;  The  intensity  in  each  of  the 
wedges.  Most  junction  detectors,  reported 
in  literature,  do  not  address  all  these  issues 
in  toto:  very  often  ad  hoc  methods  are  used 
to  address  the  scale  or  location  issue.  Our 
main  contribution  is  a  modeling  of  the  junc¬ 
tion  (using  the  minimum  description  length 
principle),  which  is  complex  enough  to  handle 
all  the  three  issues  and  simple  enough  to  ad¬ 
mit  an  effective  dynamic  programming  solution. 

Similar  approach  can  be  used  to  model 
other  features  like  thick  edges,  blobs  and 
end-points. 

We  have  tested  this  detector  against  a  large 
variety  of  imagery  and  tested  the  parameters 
against  different  levels  of  noise  to  conclude  that 
we  have  a  very  robust  junction  detector.  We 
have  not  yet  tested  the  algorithm  within  a 
recognition  system,  which  we  plan  to  do  this 
coming  year. 
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Figure  3:  Piecewise  constant  features:  a  bar 
detector  on  the  left  column  and  a 
junction  detector  on  the  right  col¬ 
umn. 


The  Junction  Model 

We  model  a  junction  as  a  region  of  an  im¬ 
age  where  the  values  are  piecewise  constant  in 
wedge-shaped  regions  emanating  radially  from 
a  central  point,  covering  a  small  disk  centered 
at  the  point  and  omitting  a  (much)  smaller  disc 
centered  at  this  point  (see  figure  3).  The  param¬ 
eters  of  a  junction  consist  of  (i)  the  radius  of  the 
junction-disk,  (ii)  the  center  location,  (iii)  the 
number  of  radial  line  boundaries,  (iv)  the  angu¬ 
lar  direction  of  each  such  boundary,  and  (v)  the 
intensity  within  each  wedge.  The  radius  of  the 
disk  addresses  the  “scale”  issue,  and  the  loca¬ 
tion  of  the  center  is  a  kind  of  “interest  operator” 
that  determines  the  position  where  the  feature 
is  located  in  a  region,  possibly  pre-defined. 

We  can  formulate  the  junction  detection  prob¬ 
lem  as  one  of  finding  the  parameter  values  that 
yield  a  junction  that  best  approximates  the  lo¬ 
cal  data  using  minimum  description,  and  declar¬ 
ing  local  minima  of  the  error  as  junctions.  The 
best-fit  parameter  values  provide  attributes  of 
the  detected  junction. 

Let  T  denote  the  piecewise  constant  func¬ 
tion/template.  It  hcis  N  angles  and  N  intensi¬ 
ties  if  N  is  the  number  of  constant  pieces.  Fur¬ 
ther,  let  I  denote  the  input  signal. 

Define  the  energy  function,  at  a  point  {i,j)  on 
the  image  as  follows 

E=:V  +  xg, 


where  A  >  0. 


The  first  term,  V,  is  a  measure  of  the  distance 
of  the  fitted  function  from  the  data  using  the 
norm: 

roo  r2n 

/  [I{r,e)-T{9)]'^g{r)rdrdd,  (7) 

Jo  Jo 

where  g{r)  is  an  appropriate  modulating  func¬ 
tion  that  goes  to  zero  for  large  r,  thus  defining 
the  template  size. 


The  second  term,  Q,  is  a  measure  of  the  distance 
of  the  gradient  using  the  norm. 

roo  f2w 

/  |V[J(r,0)-T(0)]|V(r)r«,  (8) 

Jo  Jo 

where  g*  (r)  is  an  appropriate  modulating  func¬ 
tion,  not  necessarily  the  same  as  g{r). 


Note  that, 

dl{r,e)_  ,  ldl{r,d)_ 

^ - oZ — 

or  r  ad 

where  er  and  eg  are  the  orthonormal  vectors  in 
the  r  and  6  direction  respectively,  evaluated  at 
(r,  9).  So  we  obtain. 


G  —  ^  'R-i 


where, 


1  fdl  dT\^ 

(9) 


(10) 


We  have  considered 


ff(r)  =  I 


0 


1 


0 


r  <  Ro 

Ro  <r  <  Ri  , 
r  >  Ri 


and  g*{r)  =  r^g{r).  Our  rational  to  introduce  a 
“hole”  of  size  Rq  is  that  (i)  It  works  better  than 
without;  (ii)  Optical  effects  are  removed.  At 
the  very  center  part  of  the  junction  we  expect 
to  find  a  blurred  signal. 
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Selecting  the  Scale 

A  user-defined  threshold  bounds  XR\  this  de¬ 
fines  Ri.  Ro  is  often  a  user-defined  fraction  of 
Ri:  this  allows  a  small  hole  in  the  center.  A 
significant  observation,  from  the  series  of  exper¬ 
iments,  has  been  the  use  of  non-zero  Rq. 

To  select  a  best  location  in  an  image  region:  XR, 
(with  not  necessarily  the  same  iii),  is  evaluated 
at  the  points  in  the  region.  The  one  with  the 
minimum  value  defines  the  location.  We  illus¬ 
trate  this  in  Figure  5. 

Estimating  wedge,  angles  and 
intensities 

Let  us  assume  the  number  of  wedges  is  fixed. 
We  propose  a  dynamic  programming  formula¬ 
tion.  Let  Ai,  A2,  ...,  Ak  denote  the  range 
of  admissible  intensities,  61,  62,  ■  ■  ■,  6k,  the 
discretized  angles.  Then,  dynamic  program¬ 
ming  can  minimize  E  —  V  +  XA  over  the  set 
{iAi,di),i  =  (for  more  details  see 

[17]). 

Estimating  the  number  of  wedges:  The  (opti¬ 
mal)  number  of  wedges,  N,  is  computed  by 
thresholding  the  relative  error,  r". 


Although,  in  principle,  we  are  looking  for  the 
minimum  r”,  in  practice  we  terminate  the  com¬ 
putation  when  r”  drops  below  a  pre-defined 
threshold,  as  n  is  increased.  Note  that  as  the 
number  of  parameters  increase,  ^  decreases, 
i.e.,  r"  <  1.  A  variational  form  that  may  justify 
this  approach  is  to  use,  as  suggested  in  the  book 
of  Morel  and  Solimini  [15],  a  split  and  merge  al¬ 
gorithm  for  the  energy  E  =  log{r^). 

To  summarize,  the  following  steps  are  involved 
in  detecting  junctions  on  a  large  region  of  an 
image: 

•  Compute  TZ,  the  measure  of  radial  varia¬ 
tion,  and  Ri,  the  size  of  the  template  at  every 
point. 

•  Filter  the  locations  using  a  threshold  on 

n. 


•  In  a  neighborhood  of  a  filtered  location, 
pick  the  one  with  minimum  TZ,  and  remove  all 
other  locations  within  a  radius  of  Ri  of  this. 
Repeat  this  for  all  the  filtered  locations. 

•  Compute  the  junction  parameter  for  all 
the  filtered  locations. 

4.1  Results  of  Experiments 

We  carried  out  a  series  of  experiments  on  syn¬ 
thetically  generated  images  (see  in  Figures  4)  to 
test  the  stability  of  the  algorithm  in  the  pres¬ 
ence  of  noise  in  the  image,  and  to  further  un¬ 
derstand  the  role  of  smoothing.  In  these  ex¬ 
periments,  we  suppressed  the  use  of  threshold 
values  using  a  fixed  Ri  and  fixed  iV  =  3,  i.e., 
forcing  the  algorithm  to  pick  up  the  otpimal  3- 
corners.  We  then  carried  the  experiments  on 
real  images  to  test  the  general  performance  of 
the  algorithm. 


Stability  of  the  Algorithm:  Figure  4  shows 
the  result  of  one  such  experiment  where  we  look 
for  3-corners  in  the  center  region  of  the  image. 
It  shows  that  when  the  image  edges  are  sharp 
(a  =  0 ...  3  in  the  figure),  the  error  in  the  an¬ 
gles  and  the  intensities  are  slightly  higher.  The 
errors  are  least  in  the  range  of  cr  =  6 . . .  20,  and 
increase  further  on. 

Note  that  the  (Manhattan)  distance  of  the  lo¬ 
cation  of  the  3-corner  is  less  than  6  pixels;  the 
error  in  the  angles  is  bounded  by  30°  at  the  very 
worst  and  the  intensity  differs  by  40  units  in  the 
worst  situation.  It  also  shows  that  at  the  best 
situation  the  angle  differs,  on  an  average,  by  15° 
from  the  true  answer. 

When  the  images  used  are  the  smoothed  version 
of  the  ones  shown  in  Figure  4,  the  reduction  in 
error  around  the  sharp  images  or  low  values  of 
a. 

It  is  worth  to  note  that  even  for  sharp  junctions 
(without  noise  added)  the  algorithm  performs 
better  when  smoothing  is  introduced.  This  is 
because  of  the  natural  discretization  of  the  im¬ 
age  and  the  the  numerical  scheme  to  integrate 
/(r,  6)  over  r.  This  bias  is  clearly  reduced  when 
homogeneous  smoothing  is  introduced  prior  to 
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the  numerical  integration. 

Figure  6  show  the  results  of  the  detector  on  re¬ 
gions  of  images.  After  the  filtering  it  computes 
the  parameters  for  about  four  junctions,  in  a 
minute,  on  a  Sun  Sparc  Station. 

5  Articulated  Object  Recognition 

Our  approach  is  a  Bayesian  integration  of  shape 
similarity  and  snakes,  and  naturally  combines 
top-down  and  bottom-up  algorithms.  The 
bottom-up  method  extracts  edges,  then  con¬ 
structs  snakes  (or  contomrs)  by  grouping  edge 
elements  and  feeds  the  shape  analysis.  The 
top-down  one  uses  shape  analysis,  by  comparing 
the  object  model  with  the  extracted  snakes,  to 
guide/prune  the  search  for  other  snakes.  The 
optimizations  are  based  on  Dijkstra  algorithm 
and  further  pruning  of  this  algorithm  is  ob¬ 
tained  by  working  on  object  parts  separately. 
Our  approach  is  general  enough  to  handle  three 
dimensional  objects,  but  our  focus  here  is  on 
two  dimensional  contours. 

The  Model 


10, 


We  are  given  an  image  I  and  first  establish  the 
difference  and  similarity  between  the  processes 
of  recognizing  and  of  finding  objects  in  images. 

In  order  to  recognize  a  contour  model  F'^  in  the 
image,  we  construct 

F(r^|/) 

=  E(r^,{e{.)})^(r^r^{t(s)}|/)  (11) 

=  E(r- {t{s)}  |r^, 

where  the  sum  is  over  all  possible  im¬ 
age  contours  and  all  possible  correspon¬ 
dences,  {t(s)},  between  the  image  contour  F^, 
parametrized  by  t,  and  the  model  contour  F^, 
parametrized  by  s.  Note  that  while  we  may  not 
know  P(F‘^1/)  we  can  create  simple  generative 
models  for  contours  to  obtain  P(F^|7). 

The  problem  of  finding  contomrs  in  images  can 
be  thought  as  the  one  of  finding  the  contour 
F^*,  in  the  image,  that  most  contribute  to  the 
sum  above,  i.e.  to  find 


Noise  in  image 

Figure  4:  Test  of  stability:  8-bit  images  with 
Gaussian  noise.  The  first  image 

in  the  top  left  has  standard  devia¬ 
tion  (7  =  0  (that  is,  no  noise)  and 
the  angles  of  the  3-corner  at  the 
center,  (x,y),  are  (ai/,a2/,a3/)  = 

(90°,  180°,  315°),  with  intensities 

{iifMfMf)  =  (120,200,40).  Noise 
(cr)  is  varied  from  0  to  69  to  obtain  24 
images,  three  of  which  are  shown  in 
the  top  row  for  illustration.  The  fol¬ 
lowing  errors  axe  computed:  location, 
errLa  =  \xa  —  a;|  -|-  ly^  —y\,  intensities, 
{errl„f  =  -  ijf)^/^,  angles, 

(errA^)^  =  -  ajf)'^/S  and 

plotted  vs  cr.  These  images  axe  now 
Gaussian  smoothed,  with  incxeasing 
factor  as  the  image  gets  noisier.  The 
first  eight  images  are  smoothed  using  a 
cTs  =  2.0,  the  next  eight  uses  cxg  =  3.0 
and  the  last  set  uses  ag  =  4.0.  We  plot 
the  errors  as  in  the  previous  one.  As 
expected  we  axe  able  to  get  rid  of  the 
high  errors  for  the  very  sharp  images 
(low  values  of  a). 
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Input  image.  Lsg.ai  marked.  Region  around  Lso.ai- 


• 

£38,30,  P  —  0.61. 

L4o,3o,  tc'—  0.49. 

£38,31,  P  —  0.57. 

£39,31,  P-  =  0.36. 

L.o.si ,  "  0.47. 

£38,32,  P  —  0.47. 

£39,32,  TZ  =  0.38. 

£40,32,  TZ  =  0.44. 

The  junction  at  location  £39,31 . 

Figure  5:  The  use  of  Tl,  to  locate  the  cen¬ 
ter  of  the  junction.  'Lx,y  indicates 
the  X  and  y  coordinates  on  the  im¬ 
age.  Note  that  the  location  £39,31 
has  the  minimum  TZ  in  the  neigh¬ 
borhood.  Incidentally,  i?i,  the  size 
of  the  template  is  the  same  for  all 
the  nine  locations. 


(b)  Junctions.  (c)  Junction  templates  of  (b). 


The  focus  of  our  work  becomes  to  compute  the 
optimal  (r^*,  {t*(s)})- 

Key  points  and  features 

We  introduce  the  possibility  of  detecting  local¬ 
ized  features  in  the  image  represented  by  a  set 
{yj.  :  k  =  They  can  represent  corners, 

junctions,  high  curvature  points,  etc.  Since  our 
models  are  contour  models,  we  use  an  ordered 
set  of  model  features  [a;^  :  s  =  l,...,^]  repre¬ 
senting  possibly  high  curvature  points,  or  bend¬ 
ing  points. 

The  key  points  of  a  model  contour  are  points 
that  convey  a  lot/most  of  information  about  the 
contour  (see  Figure-7(a)).  These  points  con¬ 
nected  by  straight  lines  can  give  a  good  cari¬ 
cature  of  the  contour,  even  when  this  contour 
undergoes  deformations/articulations.  All  the 
points  of  a  contour  figure  that  are  allowed  to 
bend  (articulation  points)  are  included.  It  is 
beyond  the  scope  of  this  paper  to  derive/detect 
these  points  from  an  information-theoretic  mod¬ 
eling.  We  will  simply  manually  select  them  for 
each  contour  model. 

Note  that  the  key  points  of  the  model  can,  in 
a  limit  case,  be  all  of  the  contour  points.  The 
distance  between  two  points  in  the  model,  s  and 
r  are  given  by  dsis,  r)  and  can  be  precomputed 
by  following  the  contour.  We  then  decompose 
P(r^|r,/)  as 

=  ^  p(r^,[x*],{2/fc}ir,/) 

=  ^  p(r^l[x,],{y,},r,i)P([x,],{yk}ir,i) 

b5],{3/fc} 

p^*“*“’-"({j/fc}| w, r, I)  P"“’‘‘^‘(lxs]ir) ,  (13) 


Figure  6:  Image  results.  Low  contrast  junc¬ 
tions  have  not  been  filtered. 


where  in  the  approximation  we  did  consider 
the  model  points,  [xg],  to  be  independent 
of  the  image  /,  i.e.,  P’"°‘i«'-P‘>»”‘'’([xg]|r)  = 
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(r^Vy) 


\WI\Ht)  +  e 


+  Ai  (t)  +  A2 


dt, 


and 


pmodei-point. /).  -P(r^|[Ss],  {Vk},  T,  I) 

is  decomposed  into  a  shape  model, 
ps/iape^pT|^2:g],  {j/jj.},  r)  and  a  snake  model 
psna*;e^pT||j^^  j  gj^y  ^}jg  -v^hole  process  is 

a  deconstruction  because  we  not  only  decompose 
the  recognition  into  various  terms,  but  later  we 
reconstruct  the  final  object  in  the  image  through 
this  decomposition. 

Ideally,  the  image  features  {yk}  are  originated  from 
the  model  points  [x^].  A  model  of  the  geometric 
projection  as  well  as  illumination  and  reflectance  is 
necessary.  However,  in  our  experiments  we  have  ne¬ 
glected  this  issues  and  simply  used  the  “SUSAN” 
feature  detector  [2l]  to  generate  a  set  of  key  features 
in  the  image.  They  are  candidates  for  matching  the 
model  key  features.  These  features  usually  capture 
corner  points,  junction  points,  and  salient  edges  and 
we  have  found  them  useful  for  our  experiments  (see 
Figure-7(b)).  Clearly,  the  total  number  of  detected 
features  affect  the  algorithm  complexity  and  we  do 
require  a  reasonable  amount  of  them  to  be  capable  of 
detecting  the  target.  Roughly  speaking,  the  feature 
resolution  should  be  fine  enough  to  sample  a  similar 
distribution  like  the  one  found  by  those  model  key 
features. 

Matching  key  points  and  features:  In  the 
above  model  we  have  not  made  explicit  that  the 
model  points  ,  [x^],  have  a  correspondence  in  the 
feature  set  {yk}-  We  can  make  so  by  introducing 
binary  matching  variables  {Ms,k  :  s  =  1,  ...,S  :  k  — 
1, ...,  AT},  so  that 

_  J  1,  when  yk  is  matched  to  x*, 

•  s,fc  —  I  otherwise. 

•  I3s=i  ^s,k  =  0  or  1.  0  occurs  when  yk  is 

not  part  of  the  model  contour. 

•  Ef=i  Ms,k  =  1. 

In  this  paper,  we  have  not  experimented  with  occlu¬ 
sions;  otherwise  we  would  need  to  consider  the  case 

ELi^*.^=0- 


psh  [es(g)-eT(f)F  ,  , 

[\esis)\  +  |0T(f)IF-'  +  iF-i  • 

We  now  give  an  intuitive  account,  the  rational,  of  the 
top-down  k  bottom-up  processes.  Following  that  we 
derive  this  rational  from  the  optimization  method, 
Dijkstra  algorithm,  to  compute  the  solution. 

The  rational:  From  the  top-down  view  of  the 
problem,  we  need  to  select  the  optimal  choice  of  M, 
i.e.,  the  optimal  matching  correspondence  between 
model  key  points  and  image  key  features.  The  algo¬ 
rithm  must  not  search  for  all  possible  configurations 
of  M  to  be  efficient.  The  guidance  of  the  top-down 
model  is  to  decide  which  pairs,  say  Mg-ij  =  1  and 
=  1,  to  consider,  without  searching  for  all  pos¬ 
sible  ones. 

The  top-down  module  hypothesizes  a  pair  of  corre¬ 
spondences,  say  Ms-1, j  =  1  and  Ms,i  —  1,  and  feeds 
the  model  contour  r®_i  to  the  bottom-up  module. 

The  bottom-up  process  test  the  hypothesis  that 
(xs_i,Xg)  corresponds  to  (j/j,2/i),  i.e.,  Mg-ij  =  1 
and  M*^  =  1,  taking  the  contour  part  r®_i  as  in¬ 
put.  There  is  freedom  to  select  different  contours 
(r^)yt  connecting  yj  to  j/j.  The  bottom-up  process 
returns  the  cost  induced  by  the  best  contour  choice, 

.  This  cost  takes  into  account  the  snake  cost 
and  the  shape  comparison  cost  with  r*_i. 

The  top-down  will  then  decide  which  other  pair  of 
correspondences  to  hypothesize  for  its  search  (see 
more  details  in  [8]).  As  a  motivation  to  read  our 
article  we  show  the  results  of  the  algorithm. 
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Abstract 

Innovations  in  Polarization  Vision  Technology  made 
over  the  past  several  years  have  enabled  the  recent  pre¬ 
liminary  demonstration  of  unique  capabilities  for  battle¬ 
field  awareness  provided  by  polarization  sensing.  Just 
underway  is  a  comprehensive  effort  in  polarization- 
based  Automatic  Target  Recognition  and  Detection 
(ATR/D)  encompassing  both  scientific  and  empirical  as¬ 
pects.  Overviewed  are  the  objectives,  research  issues, 
and,  methods  of  performance  evaluation  for  this  new  ef¬ 
fort. 

1  Motivation 

The  main  concept  motivating  this  effort  is  that  pas¬ 
sively  sensed  characteristics  of  polarization  of  reflected, 
scattered,  and,  thermally  emitted  light  can  be  pre¬ 
dictably  related  to  key  physical  characteristics  of  man¬ 
made  objects  (e.g.,  military  target  vehicles,  camouflage 
netting,  etc...)  and  to  physical  characteristics  of  natural 
background  and  terrain.  This  is  leading  to  the  systematic 
development  of  Automatic  Target  Recognition  and  De¬ 
tection  (ATR/D)  image  understanding  algorithms  with 
unique  capabilities  that  can  be  of  large  importance  to 
battlefield  awareness.  Polarization-based  image  under¬ 
standing  provides  unique  advantages  in  areas  that  have 
been  especially  problematic  using  other  existing  tech¬ 
nologies,  including:  i)  ATR/D  under  heavy  occlusion 
and  heavy  clutter,  ii)  detection  of  camouflage  netting, 
and  detection  of  presence  of  target  vehicles  under  such 
netting,  and,  iii)  distinguishing  actual  targets  from  false 
decoys.  In  addition,  polarization  cues  provide  important 
augmented  physical  features  for  target  recognition  and 
target  pose  determination.  The  following  summarizes 
some  key  points: 

•  Polarization  being  physically  orthogonal  to  intensity 
and  color  is  invariant  to  extrinsic  appearance,  either 
modified  by  camouflage  and/or  degraded  by  clutter. 

‘Past  research  funding  that  has  contributed  to  this  effort  include 
DARPA  grants  DAAH04-94-G-0278,  DAAH04-96-1-0216,  DARPA 
contract  F30602-92-C-0191,  AFOSR  grant  F49620-93- 1-0484,  and 
NSF  National  Young  Investigator  Award  IRI-9357757. 


•  Reflected  and  thermally  emitted  polarization  is  di¬ 
rectly  related  to  intrinsic  properties  of  a  target 
including  material  composition,  surface  roughness, 
and  3D  shape. 

•  Polarization  sensing  has  significant  advantages  over 
appearance  based  ATR/D  methods  not  directly 
linked  to  physical  characteristics  of  targets. 

A  major  challenge  to  ATR/D  image  understanding  is 
the  unpredictable  non-repeatibility  of  features  due  to  sig¬ 
nature  variations  of  targets  in  different  physical  states, 
further  confounded  by  coincidental  appearance  of  clutter- 
ered  scene  elements  and  intended  deceptive  appearance 
of  decoys.  The  use  of  polarization  in  this  effort  predic- 
tively  links  distinctive  physical  signatures  to  man-made 
objects  dependent  upon  physical  characteristics  such  as 
local  orientation,  local  change  in  orientation  (i.e.,  local 
“bending”  or  curvature),  material  roughness,  and,  mate¬ 
rial  composition  while  being  invariant  to  brightness  and 
color  appearance  states.  With  validation  from  support¬ 
ing  reflectance  models,  reliable  levels  of  repeatable  algo¬ 
rithm  performance  can  be  achieved  by  predictively  link¬ 
ing  signatures  to  physical  characertistics  that  distinguish 
man-made  objects  from  natural  background  (including 
cluttering)  for  target  detection,  and  that  distinguish  be¬ 
tween  man-made  objects  for  target  recognition  and  iden¬ 
tification  of  decoys.  Furthermore,  polarization  signa¬ 
tures  can  be  used  to  solve  backwards  estimating  some 
aspects  of  the  state  of  a  target  vehicle,  such  as  its  relative 
pose.  The  innovation  of  polarization  camera  technology, 
developed  by  the  principal  investigator  under  previous 
DARPA  efforts,  now  makes  it  possible  to  reliably  acquire 
large  data  collections  of  empirical  polarization  imagery 
for  algorithm  development,  test  and  performance  evalu¬ 
ation,  and  direct  comparison  with  predictive  reflectance 
modeling. 

Successful  completion  and  demonstration  of  this  effort 
could  significantly  impact  the  state-of-the-art  in  ATR/D 
with  the  development  of  a  new  sensory  class  of  image 
understanding  algorithms  that  transcend  conventional 
methods  of  disguising  target  assets  using  camouflage  ap- 
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pearance  methods  such  as  with  paint,  netting,  and,  clut¬ 
tering  of  background.  Even  further-  countermeasures  to 
defeat  polarization-based  ATR/D  image  understanding 
algorithms  will  require  a  serious  rethinking  by  military 
contractors  about  the  design  and  construction  of  camou¬ 
flage  netting  and  target  decoys,  what  intrinsic  material 
properties  paints  will  have  to  possess  that  cover  target 
vehicles,  and,  even  possible  reconsideration  of  the  exte¬ 
rior  geometric  characteristics  of  military  vehicles  them¬ 
selves.  While  the  basic  research  aspects  of  this  effort  in 
terms  of  predictive  polarization  reflectance  modeling  are 
primarily  intended  for  the  development  and  validation  of 
polarization-based  ATR/D  algorithms,  they  will  in  fact 
simultaneously  provide  significant  insight  into  just  what 
countermeasures  are  needed  to  defeat  such  algorithms. 

2  Objectives 

The  highest  level  objective  of  this  effort  is  to  demon¬ 
strate  a  definitive  proof  of  concept  of  polarization  vision 
as  a  new  innovative  technology  having  significant  impor¬ 
tance  for  Automatic  Target  Recognition  and  Detection 
for  battlefield  awareness. 

2.1  Scope 

This  is  a  comprehensive  research  effort  spanning  a  di¬ 
verse  range  of  tasks  for  the  development  and  demonstra¬ 
tion  of  polarization-based  technology  addressing  some 
very  challenging  problems  in  Automatic  Target  Recogni¬ 
tion  and  Detection  (ATR/D).  The  scope  of  this  effort  in¬ 
volves  extensive  data  collection,  extension  of  polarization 
camera  sensor  design  into  the  IR,  physical  reflectance  and 
outdoor  illumination  modeling,  rendering  of  predicted 
polarization  signatures  from  physical  models,  algorithm 
development,  and  performance  verification.  The  follow¬ 
ing  summarizes  some  of  the  major  aspects: 

•  Empirical  collection  of  polarization  imagery  on  var¬ 
ious  known  domestic  and  foreign  targets  using  vis¬ 
ible,  near-IR  (night  vision),  and  mid-IR  (Forward 
Looking  IR  imaging) ,  polarization  sensing  technolo¬ 
gies. 

•  Accurate  physical  modeling  of  incident  sky  illumi¬ 
nation,  and,  polarization  reflectance  from  targets, 
camouflage,  and  terrain. 

•  Predictive  rendering  of  characteristic  target  signa¬ 
tures  using  detailed  BRL-CAD  models  of  military 
target  vehicles. 

•  Algorithm  development,  demonstration  and  evalu¬ 
ated  performance  using  collected  image  data,  guided 
by  the  science  of  reflectance  modeling  and  predictive 
rendered  signatures. 

Empirical  imaging  of  ground-based  targets  by  ground- 
based  polarization  camera  sensors  will  take  place  on  test 
ranges  at  the  Aberdeen  Proving  Ground,  and,  at  Fort  AP 
Hill.  Therefore  the  environment  in  which  targets  will  be 


embedded  is  primarily  woodland  terrain.  Targets  to  be 
imaged  will  include  both  domestic  and  foreign  assets,  in¬ 
cluding  United  States  M-1  Tank,  M-3  Bradley  Fighting 
Vehicle,  HMMWV  (High  Mobility  Multi- Wheeled  Vehi¬ 
cle,  of  National  Guard  variety),  and,  Russian  BMP-1  In¬ 
fantry  Fighting  Vehicle,  BTR70  Personnel  Carrier,  and, 
T72  Tank. 

Three  (3)  basic  types  of  polarization  camera  sensors 
will  be  used  in  three  respective  spectral  ranges.  These 
spectral  ranges  are  the  visible  range  400-700nm,  the  near 
infrared  range  700-900nm  in  which  night  vision  photo¬ 
multiplier  technology  operates,  and,  the  mid-infrared 
ranges  3-5  micron  and  8-12  micron.  An  important  ob¬ 
jective  is  to  extend  current  polarization  camera  design 
that  has  worked  so  well  in  the  visible  spectrum,  to  the 
near-IR  and  mid-IR  spectrums. 

2.2  Some  Details 

Wavelengths  in  the  visible,  and  near  to  mid  IR 
spectrums  are  significantly  smaller  than  any  discernible 
shape  detail  on  military  targets  and  therefore  polariza¬ 
tion  information  provided  at  these  wavelengths  has  very 
high  spatial  resolution.  It  is  this  characteristic  that 
makes  such  polarization-based  ATR/D  algorithms  ad¬ 
vantageous  for  detecting  and  recognizing  partially  oc¬ 
cluded  targets,  even  possibly  under  very  high  occlusion. 
All  that  is  required  for  detection/recognition  is  the  line- 
of-sight  visibility  of  a  portion  of  a  target  with  a  dis¬ 
cernible  material  and/or  shape  characteristic  identified 
by  a  measured  polarization  signature,  or,  set  of  signtau- 
res.  This  also  applies  to  netted  camouflage  material  and 
targets  concealed  by  such  camouflage.  The  following  is 
a  list  of  some  of  the  related  detailed  objectives  for  this 
effort: 

•  Optimize  detection  and  recognition  performance  of 
vehicle  targets  under  various  levels  of  occlusion  by 
objects  in  woodland  terrain,  up  to  90%  occlusion 
(day  and  night  operation) . 

•  Optimize  detection  of  camouflage  nets  in  woodland 
terrain  (day  and  night  operation) 

•  For  detected  camouflage  nets,  optimize  accuracy  for 
detecting  presence  of  a  concealed  vehicle  (day  and 
night  operation) 

•  Optimize  discrimination  ability  between  actual  tar¬ 
gets  and  false  decoys  (day  and  night  operation). 

•  Validate  target  and  background  polarization  signa¬ 
tures  with  predictive  reflectance  modeling. 

3  Research  Issues 

In  the  same  vein  as  previous  research  efforts  in  po¬ 
larization  vision,  the  basis  for  algorithm  development 
in  polarization-based  Automatic  Target  Recognition  and 
Detection  will  have  a  strong  grounding  in  the  basic  sci¬ 
ence  of  physics.  The  inherent  philosophy  for  algorithm 
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design  is  that  polarization  signatures  used  in  algorithm 
development  must  be  fundamentally  predictable  using 
physical  models,  otherwise  robustness  and  repeatability 
for  detection  and/or  recognition  is  tenuous. 

3.1  Understanding  Reflected  and  Ther¬ 
mally  Emitted  Polarization 

Recent  work  in  the  visible  spectrum  [6]  has  shown  a 
definitive  correspondence  between  predicted  reflected  po¬ 
larization  signatures  of  vehicle  targets  from  BRL-CAD 
models,  and,  empirical  polarization  imagery.  Incorpo¬ 
rated  into  renderings  of  polarization  signatures  are  re¬ 
flectance  models  as  well  as  outdoor  illumination  polar¬ 
ization  models.  Clear  skylight  is  not  unpolarized-  sun¬ 
light  incident  on  atmosphere  is  polarized  by  scattering, 
the  partial  polarization  dependent  upon  scattering  angle 
according  to  Rayleigh’s  Law  [1].  Further  understanding 
is  required  for  polarization  signatures  under  a  variety  of 
outdoor  weather  conditions  and  this  will  require  more  ac¬ 
curate  polarization  illumination  modeling  respective  to 
cloud  cover  and  humidity /temperature  conditions.  One 
objective  for  algorithm  development  is  to  extend  geomet¬ 
ric  flexible  template  matching  for  pose  estimation  (e.g., 
[2])  to  include  polarization  radiometric  features  for  both 
target  recognition  and  pose  estimation  using  such  pre¬ 
dictive  modeling.  A  significant  technical  challenge  is  the 
ability  to  do  this  with  as  approximate  knowledge  of  il¬ 
lumination  conditions  as  possible.  It  is  anticipated  that 
predictive  modeling  of  polarization  signatures  at  night 
(i.e.,  using  night  vision  photo-multiplication)  will  be  con¬ 
siderably  simpler,  as  starlight  and  moonlight  produce 
very  little  atmospheric  scattering  and  hence  such  light  is 
unpolarized.  Understanding  polarization  signatures  from 
object  vehicles  at  night  in  the  700-900nm  range  will  be 
a  significant  research  issue. 

A  major  research  issue  in  this  effort  involves  under¬ 
standing  the  polarization  signatures  respective  to  IR  ra¬ 
diation  particularly  in  the  mid  infrared  3-5  micron  and 
8-12  micron  (day  and  night  operation).  Polarization  in 
the  mid-infrared  (i.e.,  FLIR)  for  man-made  objects  of¬ 
fers  even  more  signatures  for  enhanced  ATR/D.  While 
similar  polarization  effects  occur  for  reflected  infrared 
radiation,  additional  polarization  phenomena  occur  for 
thermal  radiation  from  heated  objects.  A  model  origi¬ 
nally  proposed  for  polarization  of  the  diffuse  component 
from  objects  in  the  visible  spectrum  using  Fresnel  theory 
[7]  can  be  used  to  fundamentally  predict  thermally  emit¬ 
ted  polarization  signatures.  Part  of  this  research  effort 
is  to  determine  what  modifications  and  additions  to  this 
model  are  required  for  more  precise  prediction.  Figure  1 
shows  a  simulation  of  predicted  partial  polarization  from 
a  thermally  emitting  cylinder  at  400°K  predicted  by  the 
model  in  [7]  revealing  that  significant  partial  polarization 
occurs  where  the  local  surface  normal  is  very  oblique  rel¬ 
ative  to  viewing  (i.e.,  more  than  60").  The  implications 
of  this  are  important.  While  highlighting  and  providing 
easier  segmentation  of  occluding  and  self-occluding  con- 


Figure  1:  Two  views  of  a  simulated  thermally  emit¬ 
ting  cylinder  at  400"  K.  The  top  images  are  a  standard 
shaded  rendering  of  the  cylinder  to  illustrate  its  pose. 
The  bottom  images  render  partial  polarization  from  ther¬ 
mal  emission  as  a  grey  value.  Zero  partial  polarization 
is  black.  The  brightest  grey  value  in  the  bottom  images 
corresponds  to  partial  polarization  of  just  above  40%. 
The  cylinder  on  the  left  is  tilted  45"  forward  from  up¬ 
right  and  the  cylinder  on  the  right  is  tilted  7"  forward 
from  upright.  Note  also  how  partial  polarization  becomes 
visible  on  the  top  surface  as  it  becomes  oblique. 


tours  of  heated  portions  of  a  vehicle  (e.g.,  engine  parts), 
recognition  and  detection  of  vehicles  for  night  and  day 
operation  is  significantly  enhanced.  This  also  provides  a 
means  for  distinguishing  a  3D  thermal  signature  from  a 
2D  thermal  signature  such  as  from  electric  blankets  used 
to  deceptively  represent  thermal  emissions  from  engines 
on  decoys  for  targets. 

Figure  2  shows  an  empirical  polarization  image  (right) 
of  a  side  view  of  a  truck  with  engine  running,  using  a  3-5 
micron  FLIR  with  polarizers  installed  in  a  filter  wheel. 
This  image  was  taken  at  the  Lockheed-Martin  facility, 
Denver  Colorado.  The  left  image  shows  a  standard  IR 
image  while  the  right  image  of  Figure  2  shows  bright 
pixels  superimposed  on  the  darkened  left  image  denoting 
where  partial  polarization  from  thermal  emission  is  above 
5%.  This  empirically  demonstrates  the  same  qualitative 
effect  of  high  partial  polarization  present  at  surface  nor¬ 
mals  significantly  oblique  relative  to  viewing,  anticipated 
by  the  simulations  shown  in  figure  1.  The  next  step  is 
quantitative  comparison/ verification  from  empirical  IR 
polarization  images  of  known  ground-truth  thermally  ra¬ 
diating  objects. 
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Figure  2:  Two  3-5  micron  FLIR  images  taken  of  the  side  of  a  truck  with  engine  running,  at  Lockheed-Martin, 
Denver.  The  left  image  is  a  standard  FLIR  image.  The  right  image  is  a  polarization  FLIR  image  which  shows  partial 
polarization  greater  than  5%  in  bright  white  superimposed  on  left  image  (left  image  is  darkened  for  clarity  of  partial 
polarization  depiction). 


3.2  Extending  polarization  sensors  into  the 

IR 

Under  past  DARPA  efforts  the  Computer  Vision  Lab¬ 
oratory  at  Johns  Hopkins  has  developed  a  polarization 
camera  design  for  the  broad  visible  spectrum  using  two 
twisted  nematic  (TN)  liquid  crystals  and  a  fixed  polar¬ 
izer  modularly  fitted  onto  the  lens  of  a  CCD  camera. 
The  idea  behind  a  liquid  crystal  polarization  camera  is 
very  simple  which  is  why  it  works  so  well.  Nothing  me¬ 
chanically  rotates;  the  polarizer  remains  fixed  while  the 
twisted  nematic  (TN)  liquid  crystals  electro-optically  ro¬ 
tate  the  plane  of  the  linear  polarized  component  of  re¬ 
flected  partially  linear  polarized  light  in  synchronization 
with  the  video  rate  of  the  camera.  This  is  described  in 
detail  in  [3,  4,  5].  A  set  of  research  issues  to  be  solved 
early-on  in  this  effort  is  to  extend  this  design  concept  to 
sensors  in  the  near-IR  and  mid-IR.  One  issue  is  the  re¬ 
design  of  liquid  crystals  to  accommodate  the  IR  spectrum 
which  is  the  primary  issue  for  adaptation  to  night  vision 
technology.  A  second  issue  which  is  primary  to  adap¬ 
tation  to  FLIR  is  redesigning  electronics  for  the  liquid 
crystals  so  they  do  not  interfere  with  the  existing  sys¬ 
tem.  For  both  near  IR  and  FLIR  systems  measurement 
calibration  is  another  important  issue. 

4  Evaluation  of  Performance  and 
Progress 

The  methodology  for  ATR/D  algorithm  performance 
evaluation  will  be  to  compute  Receiver  Operator  Charac¬ 
teristic  (ROC)  curves  for  polarization  image  data  subsets 
taken  under  specifically  controled  conditions.  The  ROC 
curves  of  interest  to  us  will  plot  probability  of  target 


detection/recognition  versus  the  number  of  false  alarms 
for  a  given  algorithm.  A  good  performing  algorithm  will 
have  high  probability  of  detection  with  low  false  alarm 
rates. 

Because  this  effort  is  effectively  self-contained  in  ob¬ 
taining  its  own  data  collections  under  controlled  condi¬ 
tions,  datasets  can  be  obtained  holding  some  variables 
constant  (e.g.,  pixels  on  target  at  a  given  range  and 
at  a  given  percentage  of  occlusion)  while  varying  other 
parameters  (e.g.,  different  viewing  aspects).  Currently 
planned  are  approximately  20-40  such  polarization  im¬ 
ages  to  be  taken  for  a  given  target  under  prescribed  con¬ 
ditions  and  an  ROC  curve  for  particular  algorithms  can 
be  computed.  Now  consider  taking  the  dataset  over  again 
for  a  different  level  of  occlusion  and  compute  ROC  curves 
for  the  same  set  of  algorithms.  How  does  the  relative  per¬ 
formance  of  each  algorithm  compare  for  different  levels 
of  occlusion  ?  Does  this  change  from  vehicle  to  vehicle 
?  The  range  of  variables  across  which  polarization  image 
datasets  will  be  collected  includes:  target  type,  range- 
to-target,  pixels  on  target  (or  pixels  on  camouflage  net), 
level  of  occlusion,  orientation  of  target,  spectrum  the  po¬ 
larization  image  is  in  (i.e.,  visible,  near-infrared,  mid- 
infrared),  sky  illumination  condition  relative  to  viewing, 
weather  condition,  and  whether  it  is  day  or  night.  Such  a 
procedure  identifies  conditions  under  which  certain  algo¬ 
rithms  perform  well,  and  establishes  their  breaking  point 
conditions. 

Progress  of  this  effort  will  be  measured  primarily  in 
terms  of  improved  empirical  performance  of  ATR/D  al¬ 
gorithms  under  different  prescribed  conditions.  For  in- 
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stance,  algorithms  resulting  from  research  that  increases 
probability  of  recognition/detection  while  not  increasing 
false  alarm  rates  of  recognition/detection  for  a  given  set 
of  target  conditions  is  one  demonstration  of  a  significant 
level  of  progress.  Another  level  of  progress  can  be  mea¬ 
sured  by  increased  level  of  interest  by  the  military  in  this 
technology  for  ATR/D  applications  as  a  result  of  demon¬ 
strated  performance. 

5  Conclusion 

Overviewed  in  this  paper  is  an  effort  for  advancing 
the  state-of-the-art  in  polarization  based  image  under¬ 
standing  for  Automatic  Target  Recognition  and  Detec¬ 
tion.  It  was  stressed  that  the  effort  will  be  highly  empiri¬ 
cal,  guided  by  fundamental  physical  understanding  of  the 
processes  that  effect  the  polarization  of  light.  Progress 
of  this  effort  will  be  measured  by  demonstrated  perfor¬ 
mance. 
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Abstract 

In  this  project,  we  will  develop  and  demonstrate 
new  algorithms  for  the  illumination  and  temperature 
invariant  recognition  of  targets  in  multispectral  in¬ 
frared  images.  The  algorithms  will  be  based  on  the 
use  of  invariants  computed  from  image  regions.  These 
invariants  are  derived  from  physical  models  for  image 
formation  and  are  independent  of  viewpoint  and  the 
illumination,  atmospheric,  and  thermal  environments. 
We  have  shown  that  invariants  can  be  computed  that 
capture  arbitrary  combinations  of  spectral  and  spa¬ 
tial  information  allowing  spectral /spatial  tradeoffs  to 
be  optimized  according  to  the  characteristics  of  a  par¬ 
ticular  recognition  problem.  Since  the  algorithms  are 
derived  from  physical  models,  constraints  on  the  phys¬ 
ical  environment  can  be  incorporated  to  improve  per¬ 
formance.  Extensive  experiments  will  be  conducted  to 
demonstrate  the  effectiveness  of  the  approach. 

1  Introduction 

Automatic  target  recognition  systems  which  utilize 
only  spatial  information  are  severely  limited  in  their 
ability  to  detect  targets  with  low  contrast  or  in  envi¬ 
ronments  containing  camouflage  or  deception.  The  de¬ 
velopment  of  multispectral  and  hyperspectral  infrared 
imagers  promises  to  provide  sensor  input  that  can  be 
exploited  to  improve  the  performance  of  target  recog¬ 
nition  systems  in  these  circumstances.  The  use  of 
multispectral  infrared  imagery  for  target  recognition 
has  several  important  advantages.  Infrared  sensors 
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provide  superior  performance  to  other  passive  electro- 
optical  imaging  devices  when  night  or  bad  weather 
operation  is  required.  Infrared  images  provide  infor¬ 
mation  about  both  the  reflective  and  emissive  prop¬ 
erties  of  objects  in  a  scene.  Exploiting  this  informa¬ 
tion  across  several  bands  can  significantly  improve  the 
reliability  of  recognition  especially  in  the  presence  of 
partial  occlusion  or  camouflage. 

For  many  years,  multispectral  image  measurements 
have  been  used  as  a  feature  for  recognition.  Spectral 
measurements  in  satellite  images,  for  example,  are  of¬ 
ten  used  to  classify  areas  of  the  earth  according  to 
known  spectral  signatures  for  classes  such  as  water 
or  vegetation.  The  matching  of  color  distributions  in 
visible  images  has  been  demonstrated  for  the  recogni¬ 
tion  of  many  objects  [14].  These  methods  work  well 
in  certain  contexts,  but  depend  on  the  assumption 
that  spectral  measurements  for  a  given  object  do  not 
change.  Unfortunately,  changes  in  illumination,  atmo¬ 
spheric  conditions,  and  the  thermal  environment  can 
cause  significant  variation  in  observed  spectral  mea¬ 
surements  for  a  fixed  object.  Consequently,  the  perfor¬ 
mance  of  methods  which  use  direct  spectral  compari¬ 
son  for  recognition  degrades  significantly  in  changing 
environments. 

This  project  addresses  the  problem  of  automatic 
target  recognition  in  multispectral  infrared  images. 
New  recognition  algorithms  will  be  derived  from  a 
physical  model  for  infrared  image  formation.  The  al¬ 
gorithms  make  use  of  invariants  which  capture  spec¬ 
tral  and  spatial  properties  of  objects  but  which  are  in¬ 
variant  to  changes  in  illumination,  atmospheric  condi¬ 
tions,  and  temperature.  A  recognition  system  will  be 
implemented  based  on  invariant  computation  and  in¬ 
dexing.  The  system  will  be  evaluated  by  extensive  ex- 


1063 


perimentation  on  multispectral  images  in  several  dif¬ 
ferent  environments. 


2  Objectives 

The  algorithms  developed  in  this  project  will  sig¬ 
nificantly  improve  battlefield  awareness  in  several  ar¬ 
eas.  The  algorithms  exploit  physical  invariants  allow¬ 
ing  targets  to  be  identified  under  an  unprecedented 
range  of  conditions.  Since  the  algorithms  can  capture 
arbitrary  spectral  and  spatial  characteristics  of  targets 
and  backgrounds,  they  will  improve  on  existing  capa¬ 
bilities  to  discriminate  camouflage  from  background 
and  targets  from  decoys.  The  algorithms  allow  for  tar¬ 
gets  to  be  characterized  independent  of  context  and  for 
images  to  be  illumination  corrected  for  automatic  veri¬ 
fication  of  an  identified  target.  After  characterization, 
the  algorithms  can  be  used  to  track  the  target  under 
a  wide  range  of  conditions.  By  using  multispectral 
data  to  identify  the  material  composition  of  a  target, 
the  algorithms  can  be  used  to  match  munitions  to  tar¬ 
get  composition.  The  algorithms  can  also  be  used  for 
quantifying  destruction  in  support  of  battle  damage 
assessment. 

The  algorithms  developed  in  this  work  can  also  be 
used  to  support  the  efforts  of  image  analysts.  The 
models  can  be  used  to  characterize,  identify,  and  track 
objects  in  terms  of  spectral  and  textural  descriptors. 
Environmental  parameters  and  physical  descriptions 
of  objects  can  be  specified  by  the  image  analyst  to 
constrain  searches  for  objects  of  interest  in  multispec¬ 
tral  images.  Camouflage  and  decoys  can  be  charac¬ 
terized  and  detected.  Illumination  effects  can  be  com¬ 
pensated  for  to  generate  images  which  are  easier  to 
interpret.  The  algorithms  can  be  used  to  identify  ma¬ 
terials  in  a  scene  for  tasks  such  as  determining  terrain 
trafficability.  Methods  developed  in  this  project  can 
be  used  to  register  new  imagery  with  previous  imagery 
obtained  under  different  illumination  and  atmospheric 
conditions. 


3  Research  Issues 
3.1  Background 

Color  constancy  is  the  ability  to  recover  descriptors 
of  an  object  from  a  color  image  that  do  not  depend  on 
the  illumination.  Human  vision  exhibits  approximate 
color  constancy  over  many  different  surfaces  and  illu¬ 
mination  conditions.  Color  constancy  is  desirable  for 


Figure  1:  Surface  under  2  illuminants 


any  recognition  system  that  is  required  to  function  in 
environments  where  illumination  cannot  be  controlled. 
For  example,  figure  1  displays  images  of  the  same  tex¬ 
tured  surface  under  yellow  (left)  and  red  (right)  illu¬ 
minants.  The  sensed  images  are  significantly  different 
even  though  the  underlying  surface  is  the  same.  Com¬ 
putational  approaches  to  color  constancy  have  focused 
on  the  visible  wavelengths  where  illumination  varia¬ 
tion  is  the  most  significant  factor  in  causing  variation 
in  multispectral  measurements  for  a  fixed  object. 

Without  assumptions  about  the  scene,  color  con¬ 
stancy  is  not  possible  [4].  Several  algorithms  for  color 
constancy  are  based  on  the  premise  that  spectral  re¬ 
flectance  can  be  approximated  by  a  linear  combina¬ 
tion  of  a  fixed  set  of  basis  functions  [3] .  This  premise 
has  been  carefully  verified  using  large  sets  of  visible 
spectral  reflectance  data  [1]  [7].  Recent  work  which 
uses  linear  reflectance  models  has  combined  spectral 
and  spatial  properties  for  illumination-invariant  recog¬ 
nition  using  multispectral  distributions  [5]  [11],  local 
spatial  structure  [12],  multispectral  texture  [6],  and 
combined  properties  [2].  These  techniques  have  been 
demonstrated  for  recognition  on  many  visible  multi¬ 
spectral  images  in  the  presence  of  large  illumination 
changes.  Some  work  has  begun  on  finding  invariants 
in  single  band  long- wave  infrared  images  [8].  At  this 
stage,  however,  the  physical  basis  for  these  invariants 
is  still  under  investigation. 

In  this  project,  we  will  generalize  computational 
color  constancy  methods  to  the  infrared  by  develop¬ 
ing  algorithms  for  computing  invariants  from  multi¬ 
spectral  infrared  images.  Achieving  this  goal  presents 
several  important  challenges.  Infrared  image  measure¬ 
ments  combine  information  about  both  reflected  radi¬ 
ation  and  thermal  emissions  complicating  infrared  im¬ 
age  modeling  relative  to  the  visible  case  where  reflec- 
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Figure  2:  CARC-tan  under  two  different  conditions 


tion  is  the  dominant  process.  Infrared  image  measure¬ 
ments  for  specific  objects  will  change  with  the  ambi¬ 
ent  illumination,  atmospheric  conditions,  and  temper¬ 
ature.  For  example,  figure  2  plots  the  measured  spec¬ 
tral  apparent  temperature  for  a  surface  coated  with 
tan  chemical  agent  resistant  coating  (CARC-tan)  un¬ 
der  two  different  conditions  [13].  The  structure  of  the 
two  functions  is  significantly  different.  The  use  of  fi¬ 
nite  dimensional  models  for  spectral  reflectance  has 
not  been  analyzed  as  carefully  in  the  infrared  as  for 
visible  wavelengths.  Properties  of  these  models  must 
be  understood  to  enable  the  computation  of  infrared 
invariants  which  are  analogous  to  physical  invariants 
which  have  been  utilized  in  visible  multispectral  im¬ 
ages.  Multispectral  infrared  image  sensors  often  pro¬ 
vide  many  more  spectral  bands  than  visible  sensors  in¬ 
troducing  computational  issues  of  band  selection  and 
grouping.  It  is  also  desirable  to  develop  a  general  ap¬ 
proach  for  combining  spectral  and  spatial  information 
in  a  recognition  system. 

Our  strategy  for  recognition  is  based  on  the  compu¬ 
tation  of  invariants  from  multispectral  image  regions. 
Using  a  physical  model  for  infrared  image  formation 
and  a  linear  model  for  spectral  reflectance,  we  have 
shown  that  illumination  and  temperature  changes  cor¬ 
respond  to  affine  coordinate  transformations  of  multi¬ 
spectral  distributions.  From  this  relationship,  we  have 
derived  a  set  of  affine  invariants  that  can  be  used  for 
target  recognition  under  a  wide  range  of  conditions. 
These  invariants,  however,  do  not  exploit  the  spatial 
information  in  a  multispectral  image.  It  is  possible, 
especially  in  environments  containing  camouflage  and 
decoys,  for  regions  with  similar  multispectral  distribu¬ 


tions  to  have  subtle  differences  in  spatial  structure.  In 
order  to  capture  spatial  information,  we  have  shown 
that  distributions  in  spatially  filtered  multispectral 
images  deform  according  to  a  similar  coordinate  trans¬ 
formation  in  response  to  scene  changes.  Thus,  affine 
invariants  of  filtered  distributions  are  invariant  to  il¬ 
lumination,  atmospheric  conditions,  and  temperature. 
These  filtered  distributions  capture  information  about 
spatial  structure  in  an  image  according  to  the  spatial 
properties  of  the  filter  used.  Since  these  invariants 
can  be  tuned  to  capture  desired  spectral  and  spatial 
attributes,  they  can  be  optimized  for  many  different 
target  recognition  problems. 

3.2  Modeling  IR  Reflectance  Functions 

The  infrared  invariants  described  in  3.1  are  based 
on  the  use  of  a  linear  model  for  spectral  reflectance. 
An  important  aspect  of  this  project  is  to  determine 
regions  of  the  infrared  over  which  the  finite  dimen¬ 
sional  spectral  reflectance  model  is  accurate.  In  this 
context,  accurate  means  that  the  number  of  param¬ 
eters  required  by  the  model  matches  the  anticipated 
number  of  spectral  bands.  We  have  collected  a  large 
amount  of  high  quality  infrared  spectral  reflectance 
data  to  establish  regions  of  the  infrared  over  which  the 
linear  spectral  reflectance  model  is  accurate.  Data  for 
natural  materials  from  0.4  —  14//m  has  been  obtained 
from  the  Remote  Sensing  Laboratory  at  Johns  Hop¬ 
kins  University.  This  data  includes  measurements  for 
various  minerals,  vegetation,  rocks,  and  soils.  Details 
of  the  measurement  techniques  are  given  in  [9]  and 
[10] .  A  database  of  spectral  reflectance  data  for  man¬ 
made  materials  has  been  obtained  from  the  National 
Imagery  Resource  Library.  These  materials  include 
concrete,  road  asphalts  and  tar,  construction  materi¬ 
als,  paints,  and  roofing  materials.  The  evaluation  of 
this  spectral  reflectance  data  will  follow  a  principal 
components  analysis  over  subsets  of  the  infrared  as 
was  performed  by  Maloney  [7]  for  visible  reflectance 
spectra.  The  study  will  provide  guidance  in  the  se¬ 
lection  and  grouping  of  infrared  spectral  bands  for  in¬ 
variant  computation.  This  analysis  is  also  likely  to  be 
useful  for  other  researchers  working  on  algorithms  for 
processing  multispectral  infrared  image  data. 

3.3  Spectral/Spatial  Tradeoffs 

The  multispectral  distribution  and  spatial  invari¬ 
ants  provide  a  general  set  of  tools  for  illumination 
and  temperature  invariant  recognition  in  multispec¬ 
tral  images.  The  invariants  are  derived  from  a  gen¬ 
eral  physical  model  for  image  formation  and  can  be 
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computed  for  any  combination  of  spectral  bands  and 
spatial  properties.  Given  this  level  of  generality,  an 
important  aspect  of  algorithm  development  and  evalu¬ 
ation  will  be  to  determine  the  most  effective  invariants 
for  recognition.  This  will  require  the  development  of 
tools  for  optimizing  the  set  of  invariants  chosen  de¬ 
pending  on  generic  or  specific  characteristics  of  tar¬ 
gets,  backgrounds,  and  the  environment. 

Existing  multispectral  and  hyperspectral  imagers 
provide  from  tens  to  hundreds  of  spectral  measure¬ 
ments  at  each  pixel.  To  optimize  system  efficiency, 
some  subset  of  these  bands  should  be  selected  for  in¬ 
variant  computation.  In  addition,  as  long  as  the  linear 
spectral  reflectance  model  is  satisfied,  invariants  can 
be  computed  for  arbitrary  subsets  of  bands.  Thus, 
it  is  useful  from  a  computational  viewpoint  to  parti¬ 
tion  the  selected  bands  into  relatively  small  subsets  for 
invariant  computation.  In  particular,  spectral  bands 
which  exhibit  low  correlation  should  be  placed  in  dis¬ 
tinct  subsets.  In  this  project,  we  will  develop  methods 
for  band  selection  and  grouping  for  target  recognition 
problems. 

An  important  issue  in  the  design  of  a  recognition 
system  is  the  selection  of  the  set  of  spatial  filters 
which  are  used  to  generate  the  filtered  images  which 
are  used  to  compute  invariants.  General  considera¬ 
tions  suggest  the  use  of  a  set  of  filters  which  provides 
a  compact  representation  that  is  descriptive  enough 
to  enable  accurate  recognition.  Given  the  specifica¬ 
tions  for  a  recognition  problem,  optimized  filter  sets 
can  be  designed  that  maximize  target/background  dis- 
criminability  while  using  a  total  number  of  invariants 
that  is  as  small  as  possible.  Another  design  considera¬ 
tion  is  the  window  size  used  for  invariant  computation. 
Larger  windows  provide  better  statistical  distribution 
estimates,  but  are  more  susceptible  to  occlusion  in  the 
scene. 

The  spectral  and  spatial  invariants  must  be  simul¬ 
taneously  optimized  and  combined  for  recognition. 
Tradeoffs  amongst  spectral  bands,  spatial  filters,  and 
window  size  are  interrelated.  Increasing  the  number 
of  spectral  bands,  for  example,  can  often  be  used  to 
reduce  the  window  size  required  to  achieve  the  same 
level  of  performance.  We  will  develop  methods  for  op¬ 
timizing  the  spectral/spatial  tradeoffs  given  character¬ 
istics  of  the  recognition  task.  The  goal  will  be  to  de¬ 
termine  the  most  effective  invariant  features  for  recog¬ 
nition.  Since  different  invariants  differ  in  discrimina¬ 
tory  power  and  estimated  accuracy,  we  will  derive  op¬ 
timal  methods  of  combining  and  comparing  vectors  of 
invariants.  The  methods  for  invariant  selection  and 
combination  will  be  evaluated  using  recognition  re¬ 


sults  obtained  on  multispectral  image  data. 


4  Evaluation 

The  proposed  approach  to  target  recognition  is 
based  on  the  use  of  invariants  derived  from  physi¬ 
cal  models  for  multispectral  infrared  image  formation. 
Since  these  invariants  are  computed  directly  from  im¬ 
age  regions,  a  recognition  system  will  be  implemented 
based  on  invariant  computation  and  indexing.  The 
models  underlying  the  algorithms  will  be  evaluated 
on  large  sets  of  spectral  reflectance  and  spectral  radi¬ 
ance  data.  The  system  itself  will  be  evaluated  over  a 
wide  range  of  targets  and  backgrounds  under  different 
conditions  by  experimentation  with  a  large  volume  of 
multispectral  image  data.  System  performance  will 
be  compared  to  traditional  multispectral  recognition 
systems  that  are  based  on  comparison  of  spectral  sig¬ 
natures.  This  experimentation  will  allow  us  to  quan¬ 
tify  system  performance  for  different  classes  of  scenes 
and  also  to  analyze  systematically  a  number  of  spec¬ 
tral/spatial  tradeoffs  that  impact  performance. 

4.1  Model  Evaluation 

A  large  collection  of  hyperspectral  infrared  radi¬ 
ance  measurements  has  been  assembled  as  part  of  the 
Joint  Multispectral  Sensor  Program  [13].  This  data  in¬ 
cludes  field  measurements  of  various  targets  and  natu¬ 
ral  backgrounds  under  different  conditions  during  dif¬ 
ferent  seasons.  This  data  will  be  used  in  conjunction 
with  the  spectral  reflectance  data  described  in  3.2  to 
determine  the  accuracy  of  the  scene  change  model  over 
regions  of  the  infrared  for  various  targets  and  back¬ 
grounds. 

4.2  System  Evaluation 

We  have  obtained  a  large  volume  of  multispectral 
image  data  for  algorithm  evaluation.  The  imagery  has 
been  collected  as  part  of  the  Hyperspectral  Digital  Im¬ 
agery  Collection  Experiment  (HYDICE)  and  the  Air¬ 
borne  Remote  Earth  Sensing  Program  (ARES).  This 
data  includes  multispectral  and  hyperspectral  images 
of  targets  and  camouflaged  targets  in  various  environ¬ 
ments  including  coastal,  desert,  forest,  urban,  and  mil¬ 
itary. 

Since  the  focus  of  this  project  is  invariant  recogni¬ 
tion,  the  metric  used  for  evaluation  will  be  a  compar¬ 
ison  of  the  performance  of  the  new  algorithms  against 
traditional  multispectral  approaches  that  use  direct 
comparison  of  spectral  signatures.  The  spatial  and 


1066 


spectral  information  input  to  each  system  will  be  the 
same  to  enable  a  meaningful  comparison.  The  results 
will  be  target  recognition  rates  and  false  alarm  rates 
for  both  systems  as  a  function  of  spectral,  spatial,  and 
scene  parameters.  This  experimentation  will  lead  to 
better  understanding  of  important  issues  as  well  as 
refinements  in  the  models  and  algorithms.  The  exper¬ 
iments  will  allow  us  to  characterize  the  set  of  scenes  to 
which  our  algorithms  may  be  applied  reliably  and  to 
quantify  the  uncertainty  in  results  for  different  classes 
of  scenes.  We  expect  to  achieve  a  significant  perfor¬ 
mance  improvement  over  traditional  approaches. 

5  Summary 

This  research  will  advance  understanding  of  the 
fundamental  capabilities  and  limitations  of  multispec- 
tral  infrared  recognition  systems  which  operate  in 
changing  environments.  New  algorithms  will  be  de¬ 
veloped  for  target  recognition  in  the  presence  of  un¬ 
known  illumination,  atmospheric  conditions,  and  tem¬ 
perature.  These  algorithms  are  derived  from  a  de¬ 
tailed  physical  model  for  infrared  image  formation  and 
make  use  of  invariants  which  can  be  computed  for  any 
combination  of  spectral  bands  and  spatial  properties. 
This  flexibility  enables  these  algorithms  to  be  applied 
to  a  wide  range  of  target  recognition  problems.  Phys¬ 
ical  knowledge  about  targets,  backgrounds,  and  the 
environment  can  be  used  in  conjunction  with  the  al¬ 
gorithms  to  improve  performance  and  efficiency.  This 
research  will  reveal  the  effectiveness  and  limitations  of 
camouflage  under  different  conditions.  Insights  gained 
during  this  work  can  be  used  to  influence  future  data 
collection  and  sensor  design.  Efforts  will  be  made  to 
transfer  the  algorithms  developed  in  this  project  to 
military  and  commercial  systems. 
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Abstract 

A  new  algorithm  is  proposed  that  performs  com¬ 
petitive  principal  component  analysis  (PCA)  of  an 
image.  A  set  of  expert  PCA  networks  compete, 
through  the  Mixture  of  Experts  (MOE)  formalism, 
on  the  basis  of  their  ability  to  reconstruct  the  origi¬ 
nal  image.  The  result  is  that  the  network  finds  an 
optimal  projection  of  the  image  onto  a  reduced 
dimensional  space  as  a  function  of  the  input  and, 
hence,  of  image  content.  As  a  by-product,  the 
image  is  segmented  and  identified  according  to  sta¬ 
tionary  regions. 

1.  Introduction 

The  optimal  linear  block  transform  for  coding 
images  is  the  Karhunen-Loeve  transform  (KLT), 
also  known  as  principal  component  analysis 
(PCA).  However,  PCA  assumes  that  the  statistics 
of  the  image  are  stationary  over  the  various  sub¬ 
blocks,  an  assumption  that  is  often  violated  in  prac¬ 
tice.  One  way  around  this  is  to  use  multiple  PCA 
experts,  and  allow  each  expert  to  specialize  on  a 
different  stationary  region  of  the  image. 

Dony  and  Haykin  [1995]  developed  just  such  a 
technique  that  uses  multiple  PCA  experts  that  com¬ 
pete  in  an  iterative,  winner-take-all  format.  For 
each  sub-block,  the  winning  expert  is  the  one 
whose  sub-space  projection  has  the  greatest  vari¬ 
ance.  In  this  sense,  the  individual  PCA  experts  act 
as  a  set  of  matched  eigenfilters.  The  winning 
expert  is  then  granted  the  right  to  update  its 
weights,  using  one  of  the  on-line  learning  rules  for 
which  the  weights  are  known  to  converge  to  the 
true  eigenfilters. 


However,  hard  competition  is  not  without  its  diffi¬ 
culties.  It  is  extremely  sensitive  to  initialization, 
such  that  some  experts  may  never  be  active.  Dony 
and  Haykin  initialized  all  the  experts  to  the  global 
eigenfilters,  plus  a  small  random  value,  and  found 
that  this  kept  all  experts  active.  They  also  placed 
the  eigenfilters  on  a  Kohonen  [1982]  neighborhood 
map,  and  annealed  them  down  to  hard  competition. 

However,  hard  competition  may  not  be  optimal. 
There  may  be  sub-blocks  of  an  image  that  can  be 
reconstructed  better  using  a  combination  of 
experts,  in  soft  competition.  In  this  paper,  we  pro¬ 
pose  to  use  the  Mixture  of  Experts  (MOE)  as  the 
soft  competitive  mechanism  between  expert  PCA 
networks.  Jordan  and  Jacobs  [1994]  first  enticingly 
mentioned  the  possibility  of  using  PCA  experts  in 
the  MOE  architecture  but,  to  the  best  of  our  knowl¬ 
edge,  no  such  application  has  appeared  to  date. 

2.  Principal  Component  Analysis 

When  jc  is  a  P  dimensional  zero  mean  random  vec¬ 
tor,  it  can  be  represented  without  error  by  the  sum 
of  P  linearly  independent  vectors  as 

p 

X  =  Xw-  =  (1) 

i=  1 

The  matrix  W  is  deterministic  and  full  rank.  The 
colunms  of  W  span  the  P  dimensional  space  and 
are  called  the  basis  vectors. 

[w,  h>2  »*’p]  ■  (2) 

Since  the  vectors  Wj  satisfy  an  orthonormality  con¬ 
straint,  the  coefficients  of  the  expansion  can  be 
found  from: 
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y.  =  w]x  «  y=W^x  (3) 

Thus,  y  can  be  regarded  as  the  output  of  a  system 
that  linearly  rotates  the  input.  Maximizing  the  vari¬ 
ance  of  yi  over  the  data  set,  with  respect  to  the 
weights,  results  in  an  eigenvalue  equation  for  the 
autocorrelation  matrix  of  the  input. 


where  X  is  the  input,  D  is  the  desired  signal,  and  a 
transfer  function  Y=f(X,W),  where  W  are  a  set  of 
free  parameters,  the  maximum  likelihood  solution 
for  W  under  a  Gaussian  assumption  on  the  error  is 
equivalent  to  minimizing  the  mean  square  error. 
However,  many  data  sets  do  not  exhibit  single 
Gaussian  modes.  One  example  is  data  sets  that 
include  multiple  valued  targets. 


Rw^  =  ,  R  =  E{XX  ]  (4) 

Starting  from  scratch,  if  we  perform  the  expansion 
over  a  subset  of  the  basis  functions. 


The  Mixture  of  Experts  directly  attacks  this  prob¬ 
lem  with  a  set  of  experts  moderated  by  a  gate,  and 
modeling  their  errors  as  a  mixture  of  Gaussian 
probability  density  functions  (pdf  s): 


M 

X  =  =  fVy  M<P  (5) 

1  =  ! 

then  W  is  no  longer  full  rank,  and  we  can  only 
approximate  the  input.  However,  it  can  be  shown 
that  minimizing  the  L2  norm  of  the  reconstruction 
error  over  the  data  set, 

C  =  £[l|x-ill^]  (6) 

under  the  same  orthonormality  constraints  as  in 
(2),  again  results  in  the  eigenvalue  equation  (4)  but 
where  the  M  eigenvectors  are  chosen  on  the  basis 
of  the  M  largest  corresponding  eiegenvalues.  Thus, 
minimizing  the  L2  norm  of  the  reconstruction  error 
is  the  same  as  maximizing  the  output  variance.  The 
minimum  cost  in  (6)  is  given  by  the  sum  of  the  dis¬ 
carded  eigenvalues: 


C  .  =  T  X,  (7) 

^min  * 

/  =  A/  +  1 

An  alternative  solution  for  PCA  can  be  found  by 
means  of  a  linear  system  using  adaptive  algo¬ 
rithms.  There  are  several  such  algorithms,  however 
we  will  concentrate  on  Sanger’s  [1989]  rule,  for 
which  the  weight  update  after  the  presentation  of 

the  n*  pattern  is 

=  r\{y(n)x^(.n)-LT\y(,n)y^(,n)]n^in)}  (8) 

where  p  is  a  learning  rate  parameter  and  LT 
denotes  lower  triangular.  Equation  (8)  is  an  on-line 
algorithm  to  compute  the  optimal  sub-space  pro¬ 
jection. 

3.  The  Mixture  of  Experts 

It  is  well  known  that,  given  a  data  set  {X,D}, 


K  K 

p(.d\x)  =  ^  P{k\x)p{d\x,  k)  =  Y,  Gk^x)p{d\x,  k) ,  (9) 
i  =  1  ' 


p(d\x,  k) 


exp 


^-L(y,-d)%'(yk-d)] 


(10) 


where  yj^  is  the  output  of  the  k*^  expert,  d  is  the 
desired  signal,  a  is  the  dimension  ofy.  In  order  to 
reduce  the  number  of  free  parameters,  it  is  com¬ 
mon  to  make  the  assumption  that  the  cross  correla¬ 
tion  of  the  error  of  the  k*’’  expert  can  be 
approximated  by 

(11) 

The  mixing  coefficients,  G^,  are  a  function  of  the 
input  only,  and  can  be  regarded  as  prior  probabili¬ 
ties.  Multiplying  (9)  by  the  desired  response,  and 
integrating,  we  see  that: 

y  =  E[d\x]  =  \d  ■  p{d\x)d{,d)  = 

K  ^  (12) 

Y  G^(x)\d  p(.d\x,md)  =  Y  Gk<^x)y^{x) 

Thus  the  output  of  the  network  is  a  weighted  sum 
of  the  expert’s  outputs,  with  the  weighting  coeffi¬ 
cients  provided  by  the  gate. 

The  experts  and  gate  can  be  chosen  to  be  either  lin¬ 
ear  or  non-linear  networks,  although  in  the  original 
formulation  they  were  restricted  to  generalized  lin¬ 
ear  models  (GLIM’s).  Weigend  [1995]  used  non¬ 
linear  experts  and  gate,  and  called  the  model  non¬ 
linear  gated  experts.  The  advantage  of  using  non¬ 
linear  networks  is  that  a  single  hierarchical  layer 
can  theoretically  solve  any  problem.  In  this  paper 
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we  will  use  linear  experts  and  a  non-linear  gate. 

Instead  of  maximizing  the  likelihood,  it  is  custom¬ 
ary  to  minimize  the  log-likelihood,  resulting  in  the 
cost  function  over  the  entire  data  set: 

N  K 

C  = -InZ,  =  Yj  X 

n  =  \k=\ 

3.1  Gradient  Descent  Learning 

The  free  parameters  of  the  system  are  the  weights 
of  the  gating  network,  the  weights  of  the  expert 
networks,  and  the  cross  correlation. 

In  order  to  insure  that  the  outputs  of  the  gating  net¬ 
work  sum  to  one,  the  output  is  a  layer  of  softmax 
axons. 


N 


In  gradient  descent,  it  is  common  to  approximate 
the  total  gradient  over  all  patterns  by  the  instanta¬ 
neous  gradient.  We  now  seek  to  expand  (18)  for 
the  case  when  the  experts  are  PCA  networks. 

4.  The  Mixture  of  Experts  and  Principal 
Component  Analysis 

We  now  propose  for  the  first  time  to  use  linear 
PCA  networks  as  the  experts  in  the  MOE  formal¬ 
ism.  In  this  paper,  we  restrict  ourselves  to  gradient 
descent  learning.  We  can  incorporate  Sanger’s  rule 
into  the  MOE  by  noting  that  the  term  in  brackets  in 
(18)  is  the  weight  update  for  an  isolated  network 
trained  under  gradient  descent.  Therefore,  the 
weight  update  for  the  k*  PCA  expert  is 


exp(jj) 

T - 

X  exptiy) 


(14) 


in  which  case  the  error  backpropagated  to  the  input 
side  of  the  softmax  layer  is: 

N 

YG,(n)-H,in)  (15) 

n=  I 


Atvl(n)  = 

HJn)  j'  T  T  (1^) 

ti — (n)-LTly(n)y  (n)W  (n)]} 

After  training,  segmentation  can  be  achieved  by 
applying  a  winner-take-all  threshold  to  the  gating 
network’s  output: 

Class(_n)  =  ArgMax^[Gj^(,n)]  \<k<K  (20) 


G^(d\x,k) 

- •  (16) 

Y  GjP(d\x,  k) 
y  =  1 

The  Hj^  are  posterior  probabilities,  in  the  sense  that 
they  depend  on  both  the  input  and  the  outputs  of 
the  experts. 

There  is  an  analytical  solution  for  the  variances 
which,  given  the  assumption  in  (1 1),  is  given  by 


dC 

—  =  0^  CT, 
* 


X  ^r-(")|l«4('' 

2  _  n  =  1 


X  ^kW 


(17) 


n=  1 


The  weight  change  of  some  parameter,  in  the 
k*!’  expert  network  is  proportional  to  the  gradient: 


4.1  Training 

We  applied  the  model  to  a  512x512  grayscale 
image  of  Lena.  The  training  set  consisted  of  4096 
non-overlapping  sub-blocks  of  8x8  pixels.  The 
mean  was  removed  from  each  sub-block  prior  to 
training. 

There  were  four  linear  PCA  experts,  each  consist¬ 
ing  of  a  64  to  4  dimensionality  reduction  (and  a  4 
to  64  expansion  for  reconstruction).  The  gating 
network  was  a  two  hidden  layer  multi-layer  per- 
ceptron,  with  four  hyperbolic  tangent  activation 
functions  on  each  hidden  layer.  The  overall  gate 
architecture  was  thus  64-4-4-4. 

During  training,  the  sub-blocks  were  randomly 
presented  for  approximately  100  epochs,  until  both 
the  likelihood  reached  a  plateau,  and  the  segmenta¬ 
tion  stabilized. 

The  results  are  shown  in  Figure  1.  The  original 
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Figure  1.  Lena  training  results. 


image  is  shown  in  a),  the  reconstructed  image  in  b), 
and  the  segmentation  in  c).  The  segmentation  is 
grayscale  coded  to  indicate  the  winning  expert, 
using  (20),  for  each  sub-block.  White  is  used  for 
the  most  frequent  winning  expert,  while  black  is 
used  for  the  least  frequent  winning  expert. 


Figure  2  below  shows  the  masks  of  the  four 
experts.  The  experts  are  ordered  from  most  active 
(expert  1)  to  least  active  (expert  4),  and  the  eigen¬ 
vectors  are  ordered  from  the  largest  (top)  to  small¬ 
est  (bottom)  corresponding  eigenvalues. 


expert  1 


expert  2  3 

MjM 


expert  4 


Figure  2.  PCA  masks  of  the  four  experts. 


From  Figures  Ic  and  2,  we  see  that  each  expert 
clearly  specialized  on  a  particular  feature  of  the 
image,  since  the  experts’  components  are  signifi¬ 
cantly  different  from  each  other.  Expert  1  special¬ 
ized  on  texture,  while  experts  2  and  3  specialized 
on  edges. 

It  is  clear  that  the  MOE  network  is  not  only  doing 
data  reduction  but  also  segmentation.  This  is  an 
important  observation  since  conventional  PCA 


treats  the  entire  image  as  a  unique  class,  i.e.  it  is 
limited  to  the  representation  of  the  image.  With 
our  scheme  each  PCA  component  can  represent  a 
subset  of  the  image  patterns.  Competition  is  uti¬ 
lized  to  find  each  piece,  and  thus  the  method 
brings  discrimination  into  PCA  analysis,  which 
has  not  been  done  in  the  past. 


4.2  Generalization 

We  tested  the  trained  model  on  a  480x480  gray¬ 
scale  image  of  a  mandrill.  The  results  are  shown 
in  Figure  4  on  the  following  page.  We  see  that 
the  reconstruction  is  reasonable,  but  that  the  seg¬ 
mentation  is  not  nearly  as  smooth.  This  is  most 
likely  due  to  the  predominance  of  high  frequency 
regions,  which  demonstrates  that,  to  be  universal, 
the  method  needs  to  be  trained  on  more  images. 

4.3  Comparison  with  the  global  KLT 

We  also  compared  the  mean  squared  error  (mse) 
performance  of  the  mixture  of  PCA  experts  to  a 
global  KLT.  We  did  two  tests.  In  the  first,  we 
made  the  KLT  reduction  64:4,  identical  to  each 
of  the  experts.  The  KLT  reconstructed  image  for 
this  case  is  shown  in  Figure  3  below.  In  the  sec¬ 
ond,  we  kept  the  number  of  free  weights  the 
same,  resulting  in  a  KLT  reduction  of  64:16.  If 
the  method  was  strictly  hard  competition,  the 
first  case  would  be  appropriate.  However  since 
the  gate  is  a  softmax  function,  all  of  the  experts 
can  contribute  to  the  reconstruction  of  each  sub¬ 
block,  although  in  practice,  most  sub-blocks 
were  dominated  by  one  or  two  experts. 

The  results  for  both  the  Lena  and  mandrill 
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a)  original  image 


b)  reconstructed  image 
Figure  3.  Mandrill  testing  results. 


c)  segmentation 


images  are  shown  in  Table  1. 


Figure  4.  Global  KLT  reconstruction  (64:4). 

From  the  table  we  see  that,  for  the  Lena  image,  the 
MOE  did  outperform  the  global  KLT  with  the 
same  number  of  basis.  However,  it  did  worse  when 
compared  to  the  overall  number  of  free  PCA 
weights.  This  is  what  we  would  expect,  but  we 
believe  that  the  MOE  performance  can  be 
improved  (these  are  preliminary  results).  For  the 
case  of  the  mandrill,  the  MOE  performed  reason¬ 
ably  well  considering  that  the  KLT  performance 
was  determined  using  the  mandrill’s  own  global 
eigenvectors. 


Table  1:  MSE  (xlO'^)  Comparison. 


ModelTmage 

Lena 

Mandrill 

MOE  (64:4)x4 

1.05 

11.5 

KLT  64:4 

1.08 

10.8 

KLT  64:16 

0.26 

5.1 

5.  Conclusions 

This  paper  extends  for  the  first  time  the  MOE 
formalism  to  PCA  analysis.  The  importance  of 
this  extension  is  that  competition  is  brought  into 
PCA  analysis,  which  has  been  lacking.  This  may 
remove  the  major  obstacle  of  applying  PCA  for 
signal  classification.  We  also  show  that  the  MOE 
performs  a  little  better  than  PCA  with  the  same 
number  of  basis,  but  further  improvements  are 
possible. 
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Abstract 

We  present  a  new  information  theoretic  approach 
for  training  a  system  in  a  self-organized  or  super¬ 
vised  manner.  Information  theoretic  signal  process¬ 
ing  looks  beyond  the  second  order  statistical 
characterization  inherent  in  the  linear  systems 
approach.  The  information  theoretic  framework 
probes  the  probability  space  of  the  data  under  anal¬ 
ysis.  This  technique  has  wide  application  and  rep¬ 
resents  a  powerful  new  advance  to  the  area  of 
information  theoretic  signal  processing.  ^ 

1.0  Introduction 

We  have  recently  developed  a  new  information  the¬ 
oretic  technique  for  feature  extraction  [Fisher  and 
Principe,  1995].  This  method  differs  from  previous 
methods  in  that  it  is  not  limited  to  linear  topologies 
[Linsker,  1988]  nor  uni-modal  probability  density 
functions  (PDFs)  [Bell  and  Sejnowski,  1995].  The 
method  is  directly  applicable  to  any  nonlinear  map¬ 
ping  which  is  differentiable  in  its  parameters.  In 
particular,  we  demonstrate  that  the  technique  can 
be  applied  to  a  feed-forward  multi-layer  perceptron 
(MLP)  with  an  arbitrary  number  of  hidden  layers. 

Our  goal  is  a  theory  and  methodology  to  extract 
optimal  features  from  observations  of  the  data. 
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2.0  Information  Theoretic  Perspective 

We  approach  the  feature  extraction  problem  from 
an  information  theoretic  perspective.  Within  this 
framework,  the  feature  extraction  mapping  is 
viewed  as  a  special  form  of  a  communications 
channel,  ^(a,  ),  which  includes  nonlinear  sub¬ 
space  projections.  The  design  goal  is  to  maximize 
the  information  transmitted  through  the  channel. 

In  this  approach  there  are  two  basic  assumptions. 
First,  we  assume  that  we  have  a  finite  set  of  obser¬ 
vations  of  the  data  under  consideration.  Second, 
since  the  method  addresses  pattern  recognition  we 
assume  that  the  data  lies  in  a  subspace. 

The  figure  of  merit  we  use  is  mutual  information 
which  can  be  written 

I{x,y)  =  h{y)-hiy\x),  (1) 

where  h{  )  is  the  differential  entropy  of  a  continu¬ 
ous  random  variable  [Papoulis,  1991], 

hix)  =  -jfx(.x)ilogif^ix)))dx,  (2) 

and  /j(>jx)  is  the  conditional  entropy  which  substi¬ 
tutes  conditional  probabilities. 

The  appeal  of  mutual  information  as  a  criterion  for 
feature  extraction  is  threefold.  First,  mutual  infor¬ 
mation  exploits  the  structure  of  the  underlying 
probability  density  function.  Adaptation,  as  we 
will  show,  can  be  used  to  remove  as  much  uncer¬ 
tainty  about  the  input,  x ,  using  observations  of  the 
output,  y .  Third,  this  is  accomplished  within  the 
constraints  of  the  mapping  topology. 

One  obstacle  to  using  mutual  information  as  the 
figure  of  merit  is  that  it  is  an  integral  function  of 
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the  PDF  of  a  continuous  random  variable.  Since  we 
cannot  work  with  the  PDF  directly  (unless  assump¬ 
tions  are  made  about  its  form),  we  rely  on  nonpara- 
metric  estimates.  Nonparametric  density 
estimation  in  a  high-dimensional  space  is  an  ill- 
posed  problem.  The  approach  described  here,  how¬ 
ever,  relies  on  such  estimates  in  the  output  space, 
as  depicted  in  figure  1,  where  the  dimensionality  is 
under  the  control  of  the  designer. 


Figure  1.  Mapping  as  feature  extraction. 
Information  content  is  measured  in  the  low 
dimensional  space  of  the  observed  output. 


Viola  et  al  [1996]  has  taken  a  very  similar 
approach  to  entropy  manipulation,  although  that 
work  differs  in  that  it  does  not  address  arbitrary 
nonlinear  mappings  directly,  the  gradient  is  esti¬ 
mated  stochastically,  and  entropy  is  manipulated 
explicitly. 

3.0  Derivation  of  the  Learning  Algorithm 

As  we  stated,  our  goal  is  to  find  features  that  con¬ 
vey  maximum  information  about  the  input.  How  do 
we  adapt  the  parameters,  a ,  of  a  mapping  such  that 
this  is  the  case? 

Consider  the  mapping  91^ ;  M <N ,  of  a 

random  vector  X  e  91^ ,  which  is  described  by  the 
following  equation 

Y  =  g(a,  X)  (3) 

If  g{a,x)  is  linear,  very  little  can  be  done  to 
manipulate  the  information  about  X  contained  in 
Y  without  prior  knowledge  of  the  pdf  of  X  [Deco 
and  Obradovic,  1996].  A  nonlinear  mapping,  how¬ 
ever,  can  exploit  the  following  property  of  entropy. 
If  a  random  variable  has  a  finite  region  of  support, 
its  entropy  is  maximized  when  the  random  variable 
is  uniformly  distributed  throughout  the  region  of 
support.  One  candidate  topology  which  a  restricted 
output  range,  among  other  desirable  properties,  is 
the  MLP  using  sigmoidal  nonlinearities.  Further¬ 


more,  if  the  range  of  the  mapping  is  a  hypercube 
(as  in  the  MLP)  the  elements  of  the  output  vector 
are  statistically  independent. 

By  changing  the  parameters  of  the  mapping  such 
that  the  observed  output  distribution  is  uniform,  we 
have  effectively  maximized  entropy  (i.e.  informa¬ 
tion).  In  fact,  by  defining  an  appropriate  distance 
metric,  we  can  use  the  method  described  here  to 
maximize  or  minimize  entropy  depending  on  the 
desired  goal.  As  a  consequence  of  our  choices  the 
approach  fits  naturally  into  a  back-propagation 
learning  framework. 

As  our  criterion  we  use  integrated  squared  error 
between  our  estimate  of  the  output  distribution, 
/y(M,  y)  at  a  point  u  over  a  set  of  observations  y , 
and  the  desired  output  distribution,  fyiu) ,  which 
we  approximate  with  a  summation. 

^  1  [  (/K(“)-/r(“>)'))^'^« 

i  ^  (4) 

j 

In  equation  4,  Dy  indicates  the  nonzero  region  (a 
hypercube  for  the  uniform  distribution)  over  which 
the  Af-fold  integration  is  evaluated.  Assuming  the 
output  distribution  is  sampled  adequately,  we  can 
approximate  this  integral  with  a  summation  in 
which  KyG  9t^  are  samples  in  M -space  and  Ah  is 
represents  a  volume. 

We  use  the  Parzen  window  method  [Parzen,  1962] 
as  our  estimator  of  the  output  distribution.  The 
Parzen  window  estimate  of  a  PDF  is  written 

hiu,y)  =  (5) 

y  1  =  1 

where  k(  )  is  the  kernel  function, 
y  =  {>’i.  •  •  •.  }  are  the  set  of  observations  at  the 

output  of  the  mapping,  and  u  is  the  location  at 
which  the  output  estimate  is  being  computed.  Since 
the  output  observations  are  functional  mappings  of 
the  input  data,  we  can  rewrite  5  as 

fYiu,y)  =  (■Pj^K{g(a,x^)-u) 

=  ^fY{u,g(a,x))  (6) 

^  y  i  =  \ 


1078 


The  gradient  of  the  criterion  function  with  respect 
to  the  mapping  parameters  is  determined  via  the 
chain  rule  as 


da  KdfA  dg  AdaJ 


where  ey(Uj,  y)  is  the  observed  distribution  error 
over  all  observations  y .  The  last  term  in  7,  dg/da , 
is  recognized  as  the  sensitivity  of  our  mapping  to 
the  parameters  a .  Since  our  mapping  is  a  feed-for¬ 
ward  MLP  (a  represents  the  weights  and  bias 
terms  of  the  neural  network),  this  term  can  be  com¬ 
puted  efficiently  using  standard  backpropagation. 
The  remaining  partial  derivative,  df/dg  ,  is 


where  k'(  )  is  the  derivative  of  the  kernel  function 
with  respect  to  its  argument. 

Substituting  8  into  7  yields 


da 


^  j  i 

y')^'(yi-Uj) 

^ '  j 


The  terms  in  9,  excluding  the  mapping  sensitivi¬ 
ties,  become  the  new  error  direction  term  in  our 
backpropagation  algorithm.  By  reversing  the  order 
of  summations  in  9  we  see  that  the  error  direction 
term  associated  with  each  observation  is  a  convolu¬ 
tion  of  the  estimated  error  in  the  observed  output 
distribution,  ej,(M,  y) ,  with  the  gradient  of  the  ker¬ 
nel,  k'(  ) .  It  is  through  the  gradient  of  the  estima¬ 
tor  kernel  that  the  distribution  error  influences  the 
direction  of  each  data  observation  and  thereby 
(through  backpropagation)  the  parameters  of  the 
mapping.  This  point  will  be  further  illustrated  in 
the  next  section  for  the  case  of  gaussian  kernels. 
This  adaptation  scheme  is  depicted  in  figure  2.  As 
can  be  seen,  this  approach  fits  readily  into  the 
backpropagation  framework.  The  point  set 


X  =  {xj,  is  mapped  to  a  point  set 

y  =  {yj,  y^r}  •  The  criterion  then  estimates 
from  the  set  an  error  between  the  observed  output 
distribution  and  the  baseline  output  distribution 
(uniform  in  this  case).  From  this  distribution  error 
computed  over  the  range  of  the  output  space,  an 
error  direction  (whose  sign  depends  on  whether  we 
wish  to  minimize  of  maximize  entropy)  is  associ¬ 
ated  with  each  data  point  in  the  set  y .  This  error 
direction  is  then  backpropagated  through  the  MLP 
in  order  to  modify  the  parameters  of  the  mapping. 

4.0  Gaussian  Kernels 


Examination  of  the  gaussian  kernel  and  its  differ¬ 
ential  in  two  dimension  illustrates  some  of  the 
practical  issues  of  implementing  this  method  of 
feature  extraction  as  well  as  providing  an  intuitive 
understanding  of  what  is  happening  during  the 
adaptation  process.  The  N-dimensional  gaussian 
kernel  evaluated  at  some  u  is  (simplified  for  two 
dimensions) 


K(u)  = 


exp  - 


2710 


rexp 


u^u 


2o 


(10) 


Z  =  0^7,  N  =  2 


The  partial  derivative  of  the  kernel  (also  simplified 
for  the  two-dimensional  case)  with  respect  to  the 
input  y,  as  observed  at  the  output  of  the  MLP  is 


du 


=  k(«)Z  . 


(11) 


These  functions  are  shown  in  figure  3. 
Recall  that  the  term 


Au'^Ey{Uj,y)K'iy.-Uj) 

j 

in  9  replaces  the  standard  supervised  error  direc¬ 
tion  term  in  the  backpropagation  algorithm.  From 
the  figure  we  see  that  when  we  are  maximizing 
entropy,  the  distribution  error  through  the  kernel 
functions  act  as  local  attractors  computed  PDF 
error  is  positive  and  as  a  local  repellor  when  the 
PDF  error  is  negative.  When  we  are  minimizing 
entropy  the  behavior  is  opposite.  In  this  way  the 
adaptation  procedure  operates  in  the  feature  space 
locally  from  a  globally  derived  measure  of  the  out¬ 
put  space  (PDF  estimate). 
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In  the  first  experiment  we  wish  to  illustrate  the  dif¬ 
ferences  between  the  well  known  principal  compo¬ 
nents  analysis  (PCA)  approach  to  feature 
extraction  as  compared  to  an  entropy  driven 
approach.  We  will  begin  with  the  simple  case  of  a 
two  dimensional  gaussian  distribution.  The  distri¬ 
bution  we  will  use  is  zero  mean  with  a  covariance 
matrix  of 


Figures.  Gradient  of  two-dimensional 
gaussian  kernel.  The  kernels  act  as  attractors 
to  low  points  in  the  observed  PDF  on  the 
data  when  entropy  maximization  is  desired. 

5.0  Experimental  Results 

We  present  experimental  results  designed  to  illus¬ 
trate  the  advantages  of  the  information  theoretic 
approach. 


The  contours  of  this  distribution  are  shown  in  fig¬ 
ure  4  along  with  the  image  of  the  first  principal 
component  features.  We  see  from  the  figure  that 
the  first  principal  component  lies  along  the  Xq  - 
axis.  We  draw  a  set  of  observations  (50  in  this  case) 
from  this  distribution  and  compute  a  mapping 
using  an  MLP  and  the  entropy  maximizing  crite¬ 
rion  described  in  previous  sections.  The  architec¬ 
ture  of  the  MLP  is  2-4-1,  indicating  2  input  nodes, 
4  hidden  nodes,  and  1  output  node.  The  nonlinear¬ 
ity  used  is  the  hyperbolic  tangent  function.  We  are 
therefore,  nonlinearly  mapping  the  two-dimen¬ 
sional  input  space  onto  a  one-dimensional  output 
space.  The  plot  at  the  bottom  of  figure  4  shows  the 
image  of  the  maximum  entropy  mapping  onto  the 
input  space.  From  the  contours  of  this  mapping  we 
see  that  the  maximum  entropy  mapping  lies  essen¬ 
tially  in  the  same  direction  as  the  first  principal 
components. 


PCA  mapping 


6 

—I— I— I— I— I— 

■ 

4 

- 

- 

2 

- 

0.001 

- 

H  0 

- 

“  ■  ■  0. 

00'  ' 

-2 

- 

- 

-4 

- 

- 

-8 

...  1 

■  1  ■  .  ■ 

Xo 


Figure  4.  PCA  vs.  Entropy.  Top:  image  of 
PCA  features  shown  as  contours.  Bottom: 
Entropy  mapping  shown  as  contours. 

This  result  is  expected.  It  illustrates  that  when  the 
gaussian  assumption  is  supported  by  the  data,  max¬ 
imum  entropy  and  PCA  are  equivalent.  This  result 
has  been  recognized  by  many  researchers. 

We  are  also  concerned  with  the  case  when  the 
gaussian  assumption  is  not  correct.  In  which  case 
we  would  not  expect  PCA  and  entropy  direction  to 
be  equivalent.  We  conduct  a  second  experiment  to 
illustrate  this  point  where  we  draw  observations 
from  a  random  source  whose  underlying  distribu¬ 


tion  is  not  gaussian.  Specifically  the  PDF  is  a  mix¬ 
ture  of  gaussian  modes  with  the  following  form 

p{x)  =  l/2(A^(x,  mj,  Ij)  +  A^(x,  m2,  S2)) 

where  Nix,  m,  E)  is  a  gaussian  distribution  with 
mean  m  and  covariance  E .  In  this  specific  case 


mj  = 

-1 

= 

1  0 

_  1 

po.i 

1 

0.1  0 

m2  = 

-1 

E2  = 

0  1 

It  can  be  shown  that  the  principal  components  of 
this  distribution  are  the  eigen  vectors  of  the  matrix 

R  =  ^(Ej  +  mjmJ^  +  E2  +  m2mj) 

This  distribution  is  shown  at  the  top  of  figure  5 
along  with  its  first  principal  component  feature 
mapping.  The  bottom  of  figure  5  shows  the  image 
of  the  maximum  entropy  mapping.  As  we  can  see 
there  are  two  distinct  differences  between  this 
mapping  and  the  PCA  result.  The  first  observation 
is  that  the  mapping  is  nonlinear.  The  second  obser¬ 
vation  is  that  the  maximum  entropy  mapping  is 
more  tuned  to  the  structure  of  the  data  in  the  input 
space. 

This  experiment  helps  to  illustrate  the  differences 
between  PCA  and  information  (entropy).  PCA  is 
primarily  concerned  with  direction  finding  and 
only  looks  at  the  second  order  statistics  of  the 
underlying  data,  while  entropy  explores  the  struc¬ 
ture  of  the  data  class.  In  a  few  limited  cases,  second 
order  statistics  are  sufficient  (e.g.  gaussian)  to 
describe  such  structure,  but  in  general  they  are  not. 

5.2  A  Simple  Classification  Problem 

The  next  experiment  illustrates  the  difference 
between  the  linear  systems  and  information  theo¬ 
retic  approaches  to  classification.  We  begin  with  a 
two  class  problem  in  which  the  underlying  distri¬ 
bution  for  class  one  is  gaussian  with  mean  mj  and 
standard  deviation  Oj  while  class  two  is  gaussian 
with  mean  zero  and  standard  deviation  aj/2 .  It  is 
well  known  that  the  optimal  discriminant  function 
chooses  class  one  if  x  <  -1.29  or  0.62  <  x .  The  two 
distributions,  are  shown  in  the  top  plot  of  figure  6. 
In  the  bottom  plot  of  the  same  figure  we  replace 
class  one  with  an  exponential  distribution  of  the 
same  mean  and  standard  deviation  as  class  one. 


1081 


Figures.  PCA  vs.  Entropy.  Top;  image  of 
PCA  features  shown  as  contours.  Bottom: 
Entropy  mapping  shown  as  contours. 

The  optimal  discriminant  function  in  this  case 
chooses  class  one  if  x  >  0 . 

Both  of  these  discriminant  functions  can  be  repre¬ 
sented  with  MLPs.  The  question  is  how  to  adapt 
the  parameters  of  the  mapping.  The  usual  approach 
is  to  assign  a  functional  value  to  either  class  and 
use  the  mean-square  error  (MSE)  criterion  to  adapt 
the  weights  of  the  MLR  When  this  is  done  the 
resulting  discriminant  functions  are  shown  for  both 
cases  overlaid  onto  the  distributions  in  figure  6. 
From  the  figure  we  can  see  that  in  the  first  case 
(both  classes  are  gaussian),  the  resulting  discrimi¬ 
nant  function  is  close  to  the  optimal,  however,  in 


mse  criterion 


x/a^ 


mse  criterion 


X/(T, 

Figure  6.  Two  class  problem.  Case  1  (top): 
both  distributions  are  gaussian.  Case  2 
(bottom):  one  distribution  is  gaussian,  the 
other  is  exponential.  The  discriminant 
thresholds  are  shown  as  a  dashed-dotted 
line. 

the  second  case,  the  discriminant  is  significantly 
biased  from  the  optimal  solution.  We  attribute  this 
result  to  that  fact  that  the  MSE  criterion  by  itself  is 
not  sufficient  to  examine  the  underlying  structure 
of  the  input  data. 

Alternatively  we  first  compute  a  maximum  entropy 
mapping  over  the  mixture  class.  After  remapping 
the  input  data  we  use  the  same  MLP  as  before  to 
compute  a  new  discriminant  function.  In  the  gauss¬ 
ian  case  the  result  is  not  substantially  different, 
however,  as  shown  at  the  top  of  figure  7,  the  result 
for  the  gaussian/exponential  case  has  a  much  better 
result.  The  discriminant  function  is  now  much 
closer  to  the  optimal  solution  than  when  the  MSE 
criterion  alone  was  used.  We  attribute  this  result  to 
the  fact  that  the  maximum  entropy  pre-processor 
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mse  criterion 
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Figure  7.  Entropy/MSE  criterion  result. 
Gaussian  versus  exponential.  Entropy  pre¬ 
processing  of  the  mixture  class  removes  the 
bias  in  the  discriminant  threshold  (dashed- 
dotted  line). 

maps  the  input  data  such  that  the  subsequent  dis¬ 
criminant  is  easier  to  achieve  in  the  MLP  architec¬ 
ture. 

6.0  Conclusions 

We  have  described  a  nonparametric  approach  to 
information  theoretic  feature  extraction.  We 
believe  that  this  method  can  be  used  to  improve 
classification  performance  by  directly  choosing  rel¬ 
evant  features  for  classification.  A  critical  capabil¬ 
ity  of  the  information  theoretic  approach  is  the 
ability  to  adapt  the  entropy  of  the  output  space  of 
the  nonlinear  projection  entropy  conditioned  on  the 
input  data.  We  have  shown  that  through  the  use  of  a 
simple  differentiable  estimator,  namely  Parzen 
windows,  that  the  adaptation  of  entropy  can  fit  log¬ 
ically  into  the  error  backpropagation  model.  This 
method  is  different  from  other  entropy  based 
approaches  such  as  using  the  Kullback-Leibler 
norm  for  supervised  learning. 

We  have  also  presented  experiments  that  illustrate 
the  usefulness  of  this  technique.  Comparisons  to 
the  well  known  PCA  method  show  that  the  infor¬ 


mation  theoretic  approach  is  more  sensitive  to  the 
underlying  data  structure  beyond  simple  second- 
order  statistics. 

We  have  also  shown  that  the  approach  improved 
the  resulting  discriminant  function  for  a  non-gauss- 
ian  classification  problem.  The  data  types  used  for 
the  experiments  were  simple  by  design.  They 
served  to  illustrate  the  usefulness  of  the  method 
even  for  seemingly  simple  problems.  We  believe 
that  for  more  complicated  data  structures  such  as 
SAR  imagery  the  improvement  will  be  more  sig¬ 
nificant.  We  will  be  reporting  such  results  in  the 
future. 
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ABSTRACT 

A  physical  optics  formalism  is  used  to  establish  a  first- 
principles  analysis  for  discriminating  specular  returns 
from  diffuse  returns  in  a  1-D  synthetic  aperture  radar. 
The  optimum  Neyman-Pearson  detection  processor  is 
shown  to  substantially  outperform  the  conventional,  full- 
resolution  SAR  imager  for  extended  specular  targets. 


time  intervals.  In  particular,  the  “broad-side  flash”  phe¬ 
nomenon  exploited  by  Chaney  et  al.^  is  clearly  present 
in  our  specular  CNR  analysis.  This  specular  CNR  be¬ 
havior  directly  impacts  the  structure  and  performance  of 
the  Neyman-Pearson  optimal  detector  for  such  a  reflector: 
for  extended  specular  targets  the  optimum  detection  pro¬ 
cessor  substantially  outperforms  the  conventional,  full- 
resolution  SAR  imager. 


1.  INTRODUCTION 

Synthetic  aperture  radars  (SARs)  provide  the  coverage 
rate  and  all-weather  operability  needed  for  wide-area 
surveillance.  SAR-based  automatic  target  recognition 
(ATR)  systems  need  fast  and  effective  discriminators  to 
suppress  vast  amounts  of  natural  clutter  from,  while  ad¬ 
mitting  the  much  more  limited  set  of  man-made  ob¬ 
ject  data  to,  their  classification  processors.  Recent  re¬ 
search,  using  mm-wave  SAR  data,  has  demonstrated 
that  multiresolution  processing  offers  a  useful  discrimi¬ 
nant  in  this  regard.^  Other  work,  with  ultra- wide-band 
foliage-penetrating  SAR  data,  has  shown  that  adaptive- 
resolution  imaging  can  exploit  the  aspect-dependent  re¬ 
flectivity  of  man-made  objects.^  Neither  these  studies,  nor 
other  related  work,'^  have  taken  a  principled  approach — 
one  based  on  the  physical  characteristics  of  the  reflecting 
surfaces  and  SAR  operation — to  multiresolution  SAR  im¬ 
age  formation  and  optimal  target  detection.  The  present 
paper  is  a  first  step  toward  such  a  fundamental  assess¬ 
ment. 

Using  the  physical  optics  formalism  established  in  Park 
and  Shapiro,^  we  consider  multiresolution  SAR  image 
formation  for  a  1-D  SAR,  i.e.,  a  continuous- wave  down¬ 
looking  sensor.  We  find  that  the  carrier-to-noise  ratios 
(CNRs)  for  diffuse  and  specular  reflectors  have  different 
multiresolution  signatures.  Thus,  although  a  diffuse  re¬ 
flector  and  a  specular  reflector  of  the  same  size  have  iden¬ 
tical  normalized  CNRs  when  their  SAR  returns  are  pro¬ 
cessed  over  the  full  dwell  time,  these  reflectors  show  sub¬ 
stantially  different  behavior  when  processed  over  shorter 
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2.  1-D  SAR  PRINCIPLES 

The  essential  principles  that  permit  multiresoution  imag¬ 
ing  to  distinguish  specular  returns  from  diffuse  returns 
can  be  gleaned  from  simple  physical  arguments,  as  we 
now  show. 

2.1.  Doppler  Pulse-Compression  Imaging 

Consider  the  simple  continuous-wave  (cw)  downlooking 
radar  imager  sketched  in  Fig.  1.  A  plane  flying  at  speed 
V  and  height  L  above  the  ground  collects  returns,  i.e., 
reflections,  from  the  scatterers  it  irradiates.  Because  of 
the  plane’s  motion,  the  Doppler-shift  time  history  as¬ 
sociated  with  the  return  from  a  point  scatterer  located 
at  fs  is  a  linear  function — a  frequency  chirp — of  rate 
ud  =  — 2u^/AL,  with  zero-intercept  at  x/v,  where  x  is 
the  along-track  component  of  fg,  and  A  is  the  radar’s 
operating  wavelength.  The  time  duration  of  the  chirp 
from  this  single  scatterer  is  the  dwell  time,  i.e.,  the  length 
of  time  during  which  the  particular  scatterer  lies  within 
the  radar’s  transmitter  beam,  given  by  T  «  XL/vd  in 
diffraction-limited  far-field  operation  with  an  antenna  of 
diameter  d.  By  analogy  with  pulse  compression  operation 
of  an  angle-angle  imager,  we  can  show  that  matched-filter 
(chirp-compression)  processing  of  the  return  from  this  sin¬ 
gle  point  scatterer  will  yield  a  time-domain  output  wave¬ 
form  that  is  centered  x/v  and  has  a  nominal  width 
Xres/u,  where 


V  d  XL 
\vd\T  ~  2  ^  T 


(1) 


gives  the  along-track  spatial  resolution  of  the  system. 
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SYNTHETIC  APERTURE  vT 


PHYSICAL  APERTURE  d 


Figure  1.  Geometry  for  a  1-D  continuous-wave  synthetic 
aperture  radar. 

2.2.  Diffuse  versus  Specular  Reflections 

The  simple,  point-scatterer  decription  we  have  given  for 
the  resolution  capability  of  a  1-D  SAR  applies  directly  to 
returns  from  an  extended  rough  surface,  i.e.,  to  a  diffuse 
reflector,  but  it  does  not  apply  directly  to  the  returns  from 
an  extended  smooth  surface,  i.e.,  a  specular  reflector.  As 
sketched  in  Figs.  2  and  3,  the  returns  from  each  point 
on  a  rough  surface  add  incoherently,  whereas  the  returns 
from  each  point  on  a  smooth  surface  add  coherently.  In 
particular,  the  rough  surface  shown  in  Fig.  2  will  produce 
a  total  return  whose  time  duration  equals  the  time  this 
surface  spends  within  the  radar’s  transmitter  beam.  On 
the  other  hand,  the  smooth  surface  shown  in  Fig.  3  will 
produce  a  total  return  whose  time  duration  is  determined 
by  the  time  it  takes  the  radar’s  receiver  antenna  to  move 
through  the  diffraction  pattern  created  by  the  reflected 
wave.  For  a  large  specular  target  this  time  duration  will 
be  much  less  than  the  dwell  time  T  given  in  the  previous 
section.  Furthermore,  because  of  Snell’s  law,  the  tilt  of 
such  a  specular  target  will  lead  to  a  time  offset  in  the 
return  field  arriving  at  the  radar  receiver. 

3.  SYSTEM  MODEL  AND  CNR 

In  order  to  quantify  the  phenomena  described  qualita¬ 
tively  in  the  preceding  section,  we  shall  extend  the  phys¬ 
ical  optics  1-D  SAR  analysis  of  Park  and  Shapiro^  to 
include  specular,  as  well  as  diffuse,  reflectors,  and  mul¬ 
tiresolution,  as  well  as  full-resolution  imaging. 


INCIDENT  HELD 
highly  dlr«etional 


Figure  2.  Rough  surfaces  give  diffuse  reflections. 


INCIDENT  FIELD 


Figure  3.  Smooth  surfaces  give  specular  reflections. 

3.1.  IF  Signal  Model 

The  intermediate-frequency  (IF)  signal  in  the  SAR  re¬ 
ceiver  can  be  taken  to  have  complex  envelope  r{t)  = 
+  w(t),  where  y{t)  represents  the  return  waveform 
and  w(t)  is  a  zero-mean  circulo-complex,  white  Gaus¬ 
sian  receiver  noise  with  spectral  density  Nq.  In  a  conve¬ 
nient  normalization,  and  assuming  a  multiplicative  target 
model,^’®  the  return  complex  envelope  can  be  written  in 
the  following  form: 

y(t)  =  j dp  T(p  )ei  (p  -  vt) ,  (2) 

where  Pt  is  the  transmitter  power,  T{p)  is  the  field- 
reflection  coefficient  vs.  the  transverse  coordinate  vector 
p  =  (x,  y)  on  the  z  =  0  reference  plane,  and  ^z,(p)  is 
the  normalized  transmitter  field  pattern  on  this  reference 
plane.  In  this  equation,  we  have  assumed  that  the  radar’s 
transmitter  and  receiver  use  the  same  antenna  pattern, 
that  lag  angle  compensation  has  been  performed.  We 
shall  assume  that  results  from  L-xa.  far-field  free-space 
propagation  of  an  elliptical  Gaussian  beam  whose  major 
and  minor  axes  coincide  with  the  across-track  and  along- 
track  directions,  respectively.  The  choice  of  a  Gaussian 
pattern  is  for  analytical  convenience;  the  use  of  an  ellipti¬ 
cal  pattern  allows  the  across-track  resolution  to  gain  the 
superior  resolution  of  a  larger  aperture  dimension  while 
the  along-track  resolution  benefits  from  synthetic  aper¬ 
ture  processing.^  Our  model  for  a  simple  extended  dif¬ 
fuse  reflector  is  statistical:  T(p)  is  a  zero-mean,  circulo- 
complex  Gaussian  random  field  with  a  5-function  covari¬ 
ance, 

{T{p)T*{p'))  =  ^  exp(-2|p  |Va^)'5(p  -  p')>  (3) 

where  pd  is  the  diflFuse  reflectivity  and  a  is  the  radius 
of  the  reflecting  region.  Our  corresponding  model  for  a 
plane  mirror  of  the  same  nominal  size  is  essentially  deter¬ 
ministic: 

T{p)  =  y/Ts^xp{-\p\'^ /a^  +  j2ke  ■  p  +  j<j>s),  (4) 
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where  T,  is  the  intensity  reflectivity,  6  is  the  tilt,  (ps  is  the 
phase  shift  of  the  mirror,  and  fc  =  27r/A  is  the  wave  num¬ 
ber  at  the  radar  wavelength.  Because  we  seldom  know  the 
radar-to-ground  distance  to  a  small  fraction  of  a  wave¬ 
length,  we  shall  assume  that  (f>s  is  a  random  variable  that 
is  uniformly  distributed  on  [0, 27r);  the  other  parameters 
in  our  specular  model  are  non-random.  We  have  used 
Gaussian  shapes  in  our  diffuse  and  specular  models  be¬ 
cause  they  will  allow  closed-form  solutions  for  multires¬ 
olution  CNR  and  other  calculations  while  preserving  the 
essential  physical  parameter  dependencies  of  more  realis¬ 
tic  shape  functions. 

3.2.  Multiresolution  Carrier-to-Noise  Ratio 

The  architecture  of  a  simple  1-D  SAR  receiver  is  shown 
in  Fig.  4.  It  consists  of  a  chirp-compression  filter — whose 
integration  interval  that  has  zero  time-offset,  duration 
Tc — followed  by  a  video  detector.  The  carrier-to-noise  ra¬ 
tio  of  this  receiver,  CNR(t;  Tc)  is  defined  to  be  the  mean 
target-return  power  in  the  cw-SAR  image  at  time  t  di¬ 
vided  by  the  mean  receiver-noise  power  in  the  cw-SAR 
image  at  time  t.  We  are  interested  in  seeing  whether 
or  not  the  qualitative  differences  between  the  diffuse  and 
specular  reflectors,  described  earlier,  are  indeed  present 
in  the  IF  signal  model  we  have  posited.  Specifically,  does 
CNR(t;  Tc)  behave  differently,  for  the  diffuse  and  specu¬ 
lar  reflectors,  as  the  processing  time  Tc  is  increased  from 
very  small  values  up  to  the  full  dwell  time  T? 

Using  the  model  given  in  the  previous  section,  we  find 
that  the  diffuse  target’s  CNR  obeys 


CNR(f ;  Tc)  =  KdpdTc  exp 


{vt) 


21 


and  the  specular  target’s  CNR  is  given  by, 

{yt  -  Xsf 


/  rr,  ^  k^a^TsTc 

CNR(t;  Tc)  =  Ks - r - exp 


Aa:? 


(5) 


(6) 


where  Kd  and  Ks  are  constants  that  do  not  depend  on 
the  processing  time  Tc.  These  equations  show  that  the 
Doppler-compressed  diffuse  and  specular  returns  have 
CNRs  with  different  spatial  extents  Axd  and  Ax  a,  re¬ 
spectively,  and  that  the  Doppler-compressed  specular  re¬ 
turn  has  a  spatial  offset  Xg  that  the  corresponding  diffuse 
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case  does  not.  These  forms  are  indicative  of  the  signa¬ 
tures  we  had  hoped  for:  different  resolutions  (t-widths 
of  CNR(t;  Tc))  for  the  specular  and  diffuse  returns  aris¬ 
ing  from  the  coherent  nature  of  the  specular  reflection 
process,  and  an  offset  for  the  specular  return  that  is  due 
to  Snell’s  law.  Figures  5  and  6  show  that  the  desired 
signatures  do,  in  fact,  exist.  In  drawing  these  figures 
we  have  assumed  that:  the  radar’s  receiver  is  in  the  far 
field  of  the  specular  target;  the  radar’s  integration  inter¬ 
val  is  long  compared  to  the  aperture-translation  time  and 
short  compared  to  the  target-dwell  time;  and  the  target 
is  much  larger  than  the  along-track  aperture  size  of  the 
radar  transmitter. 

Figure  5  shows  that  the  diffuse-target  resolution  follows 


Figure  5.  CNR  comparison:  specular  vs.  diffuse  target 
resolution. 


Figure  6.  CNR  comparison:  specular  target  angular 
Figure  4.  Block  diagram  of  1-D  SAR  receiver.  shift. 


1087 


TItm 


a  straight  reciprocal  dependence  on  the  processing  time. 
This  is  to  be  expected.  We  have  assumed,  in  Fig.  5,  that 
Tc  is  large  enough  to  form  a  significant  synthetic  aper¬ 
ture,  but  small  enough  that  the  full,  dwell-time  limited 
synthetic  aperture  is  not  used.  The  more  interesting  part 
of  Fig.  5  is  the  specular  target  behavior.  At  short  pro¬ 
cessing  times  the  resolution  is  limited  by  the  target’s  size, 
and  independent  of  Tc-  Ultimately,  the  synthetic  aperture 
effect  takes  over,  and  the  specular  curve  merges  with  the 
diffuse  curve.  For  our  equal-sized  specular  and  diffuse  tar¬ 
gets,  this  means  their  full-resolution  (dwell-time  limited 
processing)  images  will  have  the  same  Ax  values.  Fig¬ 
ure  6  shows  the  aspect  dependence  of  the  specular  return. 
Here  we  see  that  geometric  optics  (Snell’s  law)  behavior 
prevails  at  short  processing  times,  but  the  tilt  dependent 
shift  in  the  SAR  image  disappears  as  the  processing  time 
is  increased. 

4.  TARGET  DETECTION  THEORY 

Our  final  task  will  be  to  show  that  the  multiresolution  sig¬ 
natures  demonstrated  in  the  last  section  have  important 
implications  for  target  detection.  We  consider  the  follow¬ 
ing  idealized  binary  hypothesis  testing  problem.  Under 
hypothesis  Hq,  the  normalized  IF  complex  envelope  sat¬ 
isfies, 

r(t)  =yd(i)+w(t),  (7) 

where  yd{t)  is  the  return  from  a  statistically  homogeneous 
diffuse  clutter,  i.e.,  yd{t)  is  given  by  Eq.  2  for  a  diffuse 
reflector  whose  covariance  function  is  found  from  Eq.  3 
with  a  00.  Under  hypothesis  Hi,  we  have  that 

r(t)  =  Vsit)  +  ydit)  +  w(t),  (8) 

where  ys(f)  satisfies  Eq.  2  with  T{p)  from  Eq.  4.  Under 
either  hypothesis,  w(t)  is  white  Gaussian  receiver  noise. 

The  optimum  Neyman-Pearson  detection  processor  for 
this  problem  processes  r(t)  to  obtain  maximum  detection 
probability  Pd  =  Pr(say  Hi  \  Hi  true)  for  a  prescribed 
value  of  the  false-alarm  probability  Pp  =  Pr(say  Hi  \ 
Ho  true).  Note  that  we  will  assume  that  this  proces¬ 
sor  knows  all  the  parameters  of  the  diffuse  and  specular 
reflectors.  (This  same  assumption  will  be  made  for  the 
two  suboptimum  receivers  described  below.)  In  a  real 
detection  scenario,  many  of  these  parameters  may  be  un¬ 
known  and  will  have  to  be  estimated  from  the  data.  Our 
desire  here  is  to  show  that  the  multiresolution  processor 
can  offer  significant  performance  improvement  over  a  con¬ 
ventional,  full-resolution  SAR  imager.  If  this  is  not  the 
case  in  an  idealized,  known-parameter  setting,  it  seems 
unlikely  that  any  such  advantage  could  prevail  in  more 
realistic  situations. 


Radar  Ratum  * 
Raealvar  NoIm 


I  Matetiad-Fllar 
I  VWaa-Oataater 


'HinaheM 

ComRarlaon 


Dataatian 

Daelalan 


Figure  7.  Optimum  Neyman-Pearson  receiver. 


Figure  8.  1-D  SAR  imager  receivers. 


The  optimum  detection  problem  we  have  posed  is  well 
known.®  The  structure  of  the  optimum  receiver  is  shown 
in  Fig.  7;  it  consists  of  a  whitening  filter  followed  by  a 
matched-filter /video-detector  and  then  a  threshold  test. 
There  are  two  other  receivers  whose  performance  we  shall 
compare  to  that  of  the  optimum  system:  the  conven¬ 
tional,  full-resolution  1-D  SAR  imager,  and  the  optimized 
multiresolution  1-D  SAR  imager.  These  receivers  share  a 
common  block  diagram,  shown  in  Fig.  8.  The  difference 
between  the  these  two  receivers  is  as  follows.  The  con¬ 
ventional  imager  uses  no  integration-interval  offset  in  its 
chirp  compressor.  In  other  words,  it  makes  no  attempt  to 
accommodate  the  Snell’s  law  shift  in  the  time  at  which 
the  specular-target  return  occurs.  Also,  the  conventional 
imager  uses  the  full  dwell  time  for  its  chirp  compression 
integration;  it  does  not  try  to  exploit  the  multiresolu¬ 
tion  specular  signature  we  demonstrated  in  the  previ¬ 
ous  section.  In  contrast,  the  optimized  multiresolution 
imager  chooses  its  integration-interval  offset  and  dura¬ 
tion  to  maximize  the  resulting  Pd  at  the  prescribed  Pp- 
This  receiver  represents  an  idealized,  known-parameter 
1-D  SAR  version  of  the  aspect-dependent  processor  re¬ 
ported  by  Chaney  et  al.^ 

The  receiver  operating  characteristics  {Pd  vs.  Pp  be¬ 
haviors)  of  the  preceding  three  receivers  share  a  common 
form,® 

Pd  =  Q{d,  y/-2\nPp),  (9) 


where 


dz  z  e^p{-a^ /2  —  z^ /2)Io{oiz), 


(10) 


is  Marcum’s  Q  function,  and  /q  is  the  zeroth-order  modi¬ 
fied  Bessel  function.  In  Eq.  9,  d?  is  the  effective  signal-to- 
noise  ratio  (SNR),  which,  in  general,  has  a  different  value 
for  each  of  our  three  receivers.  We  have  plotted  Eq.  9  for 
several  d  values  in  Fig.  9.  The  rest  of  our  analysis  shall 
concentrate  on  the  behavior  of  the  effective  SNRs. 

Let  us  use  dg,  d^,  and  to  denote  the  effective  SNRs 
for  the  optimum  receiver,  the  conventional  full-resolution 
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Figure  9.  Marcum’s  Q  function 


1-D  SAR  receiver,  and  the  optimum  white-noise  receiver, 
respectively.  The  optimum  white-noise  receiver  achieves 
maximum  Pp  at  the  allowed  Pp  for  our  binary  hypoth¬ 
esis  test  when  there  is  no  diffuse  clutter  present  under 
either  hypothesis.  It  turns  out  that  this  receiver  takes 
the  form  of  an  optimized  multiresolution  SAR  imager  in 
the  absence  of  clutter,  and  is  very  nearly  the  optimized 
multiresolution  SAR  in  the  presence  of  clutter. 

We  first  consider  the  case  in  which  tilt  is  not  an  issue, 
i.e.,  0  =  0  in  our  specular  target  model.  Figures  10  and  11 
show  log-log  plots  of  the  optimum  receiver’s  effective  SNR 
advantages  {do/dyj  and  dg/dc),  as  a  function  of  normal¬ 
ized  target  size  {a/d  with  d  being  the  radar’s  along-track 
antenna  diameter),  for  two  values  of  the  diffuse- return 
(clutter)  carrier-to-noise  ratio,  CNR^. 

By  definition,  we  have  d^)  ^  dg  as  CNR^  — >  0.  Fig¬ 
ure  10  shows  that  for  extended  specular  targets,  i.e.,  for 
2a/ d  >  1  this  equivalence  prevails  even  when  the  clutter 
is  strong.  Physically,  the  large  specular  target  presents 
a  shorter-than-dwell-time  return  to  the  radar  receiver. 
Hence,  this  return  has  a  narrower  bandwidth — less  fre¬ 
quency  chirp — ^than  the  clutter  return.  The  performance 
of  the  white-noise  receiver  approximates  that  of  the  opti¬ 
mum  receiver  because  the  clutter  spectrum  is  nearly  flat 
over  the  bandwidth  of  the  return  from  the  extended  spec¬ 
ular,  and  hence  the  whitening  filter  in  the  optimum  re¬ 
ceiver  is  superfluous.  Figure  11  shows  that  the  optimum 
recevier  has  many  decibels  of  effective  SNR  advantage 
over  the  conventional  receiver  for  a  large  specular  tar¬ 
get  {a/d  :§>!).  The  conventional  receiver  collects  noise 
over  the  full  chirp  bandwidth  of  the  dwell  time,  and  this 


log(2a/d) 


Figure  10.  Effective  SNR  comparison:  optimum  re¬ 
ceiver  vs.  white-noise  receiver  at  zero  target  tilt.  The 
CNRd  =  0.1  curve  is  indistinguishable  from  the  do  —  dy, 
line. 
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Figure  11.  Effective  SNR  comparison:  optimum  re¬ 
ceiver  vs.  conventional  receiver  at  zero  target  tilt. 
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Figure  12.  Effective  SNR  comparison;  optimum  rece- 
vier  vs.  conventional  receiver  as  a  function  of  along-track 
target  tilt. 

extra  noise  drives  its  SNR  down,  relative  to  that  of  the 
optimum  receiver,  because  the  optimum  receiver  uses  the 
much  narrower  bandwidth  associated  with  the  more  lim¬ 
ited  chirp  present  on  the  specular  target  return. 

Our  final  example  addresses  the  impact  of  target  tilt. 
In  Fig.  12  we  have  plotted  do/dc  vs.  normalized  along- 
track  tilt  at  2a/d  =  10  for  two  values  of  CNRd.  Note 
that  do  «  dw  prevails  at  this  value  of  2a/d,  so  this  figure 
also  constitutes  a  comparison  between  conventional  and 
optimized  multiresolution  SAR  imagers.  These  curves  un¬ 
derline  the  value  of  exploiting  the  aspect-dependence  of 
the  return  from  a  large  specular  target. 

5.  CONCLUSIONS 

Our  continuing  objective  is  to  develop  a  principled  ap¬ 
proach  to  the  use  of  multiresolution  image  formation  for 
discriminating  specular  returns  from  diffuse  returns  in 
synthetic  aperture  radar  data.  In  this  our  initial  effort, 
we  have  used  a  simple  cw  1-D  SAR  model  to  establish 
the  fundamental  validity  of  using  multiresolution,  aspect- 
dependent  specular  target  effects  for  the  discrimination 
task.  Complete  derivations  of  our  results  are  given  in  Le¬ 
ung,^  where  the  more  general  specular-reflector  case  of  an 
elliptically-symmetric  curved  mirror  is  considered.  This 
source  also  includes  target  models  and  CNR  behaviors  for 
dihedral  and  trihedral  reflectors,  as  well  as  a  comparison 
of  the  structure  and  performance  of  optimum  Neyman- 
Pearson,  conventional  1-D  SAR,  and  optimized  multires¬ 
olution  SAR  receivers  for  the  detection  of  a  finite  diffuse 
target  embedded  in  white  Gaussian  receiver  noise.  Our 
current  work  includes  the  extension  of  our  formalism  to 


2-D  stripmap  operation,  to  polarimeteric  SAR,  and  to  the 

detection  and  recognition  of  multicomponent  targets. 
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Abstract 

This  paper  introduces  techniques  for 
context-aided  false  alarm  reduction  for  Au¬ 
tomatic  Target  Recognition  (ATR)  in  air¬ 
borne  Synthetic  Aperture  Radar  (SAR) 
images.  Candidate  target  chips  are  iden¬ 
tified  using  Constant  False  Alarm  Rate 
(CFAR)  detection  techniques.  If  only  a  sin¬ 
gle  image  of  a  site  is  available,  a  2-D  site 
model  is  constructed  and  used  to  determine 
regions  inhospitable  to  targets.  A  frame¬ 
work  for  the  registration  and  exploitation 
of  multipass  imagery  is  developed.  After 
registration,  these  images  are  used  to  pro¬ 
vide  consistency-based  false  alarm  reduc¬ 
tion.  A  final  discrimination  step  separates 
surviving  false  alarms  from  targets.  Ex¬ 
perimental  results  using  the  TESAR  and 
ADTS  datasets  are  included. 

1  Introduction 

A  high-resolution  airborne  radar  operating  at  non¬ 
foliage  penetrating  frequencies  and  moderate  depres¬ 
sion  angles  can  produce  images  of  targets,  in  the 
clear  or  under  partial  occlusion.  Bright  scatterers  in 
radar  imagery  due  to  targets  can  be  detected  using 
CFAR  processing  [Rohling,  1983],  where  each  pixel  is 
compared  to  an  adaptive  threshold  that  is  a  function 
of  the  desired  probability  of  false  alarm,  a  statistic 
derived  from  the  reference  clutter  window  around 
the  pixel,  and  the  size  of  this  reference  window. 
Cardinal  streaks  and  other  bright  returns  from 
buildings  and  associated  rooftop  substructures  and 
false  alarms  from  foliage  increase  the  burden  on  ATR 
algorithms.  This  paper  describes  several  techniques 
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Research  and  the  Defense  Advanced  Research  Projects 
Agency  (ARPA  Order  No.  A369)  under  Grant  F49620- 
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for  false  alarm  reduction  using  context,  prior  to  dis¬ 
crimination,  for  SAR  ATR.  The  context  information 
could  be  derived  from  a  single  image  using  segmen¬ 
tation,  or  from  registered  multipass  imagery.  Given 
a  single  radar  image  of  a  site  and  the  result  of  a  tar¬ 
get  detection  algorithm,  it  is  possible  to  focus  the 
attention  of  ATR  algorithms  by  delineating  build¬ 
ings,  dense  foliage,  and  other  areas  where  targets 
are  not  expected  to  be  found.  Alternately,  if  multi¬ 
pass  imagery  of  a  site  is  available,  false  alarms  can 
be  eliminated  by  checking  for  consistency  in  detector 
output  across  images.  Although  different  statistical 
models  for  the  clutter  lead  to  different  CFAR  detec¬ 
tors,  the  algorithms  presented  here  are  independent 
of  the  type  of  detector  chosen. 

In  Section  2,  we  revisit  the  problem  of  deriving  con¬ 
text  from  a  single  SAR  image  by  constructing  a  2- 
D  site  model.  A  multiresolution  technique  for  seg¬ 
menting  SAR  intensity  images  under  Weibull  clutter 
assumptions  is  presented.  A  maximum-likelihood 
segmentation  is  performed  at  the  coarsest  resolution 
and  the  region  labels  and  a  confidence  measure  are 
propagated  to  finer  resolutions.  Only  pixels  with  low 
confidence  are  reclassified,  yielding  a  smoother  seg¬ 
mentation  map  with  a  smaller  computational  bur¬ 
den. 

Two  SAR  images  of  the  same  area  are  projections 
of  the  three  dimensional  (3-D)  scene  onto  different 
slant  planes.  Hence,  a  Euclidean  or  similarity  trans¬ 
formation  is  not  sufficient  to  register  the  two  images. 
In  Section  3,  we  show  that  registration  of  two  SAR 
images  can  be  approximated  by  an  affine  transforma¬ 
tion  which  consists  of  the  following  sequential  steps: 
projection  of  the  first  image  to  the  ground  plane;  ro¬ 
tation  and  translation  within  the  ground  plane;  and 
projection  to  the  slant  plane  of  the  second  image. 
This  transformation  can  be  derived  from  the  sensor 
acquisition  parameters,  with  a  post-registration  re¬ 
finement  for  translational  errors.  One  application 
for  exploiting  registered  SAR  imagery,  namely  com¬ 
puting  heights  of  buildings  and  the  foliage  canopy, 
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is  developed  and  applied  to  real  multi-pass  airborne 
SAR  imagery  from  MIT  Lincoln  Laboratory. 

Section  4  describes  schemes  for  false  alarm  reduction 
in  single  and  multipass  SAR  imagery  with  examples. 
After  context-aided  false  alarm  reduction,  a  discrim¬ 
ination  step,  described  in  Section  5,  is  used  to  further 
narrow  down  the  candidate  targets.  Discrimination 
is  performed  with  the  help  of  size  and  shape  features 
derived  from  the  target  chips. 

2  Multiresolution  Segmentation  and 
Region  Labeling 

In  earlier  work  [Chellappa  et  al.,  1996;  Kuttikkad 
and  Chellappa,  1995;  Kuttikkad  et  al.,  1996],  we 
have  described  techniques  for  generating  2-D  site 
models  from  single  and  multipolarization  SAR  im¬ 
agery.  Our  site  model  consists  of  delineations  of 
buildings,  roads,  and  possible  target  clusters  (in¬ 
cluding  military  and  civilian  vehicles),  as  well  as 
natural  regions  such  as  trees,  shadows,  grass,  etc. 
The  algorithm  consists  of  three  stages— bright  pixel 
detection,  segmentation,  and  labeling/recognition. 
Bright  scatterers  in  nonhomogeneous  clutter  are  de¬ 
tected  using  CFAR  processing.  Next,  a  maximum 
likelihood  (ML)  segmentation  into  a  small  number 
of  expected  terrain  classes  is  performed  to  label  the 
image.  The  conditional  distribution  of  the  backscat- 
ter,  given  the  region  label,  is  chosen  appropriately 
based  on  the  type  of  data  available  (single/multiple 
looks,  intensity/complex,  single/multipolarization, 
etc.).  At  typical  depression  angles,  a  building  pro¬ 
duces  a  bright  linear  or  L-shaped  streak  along  its 
edge(s)  facing  the  sensor,  followed  by  a  shadow  re¬ 
gion.  Bright  pixel  clusters  are  identified  in  the  CFAR 
output  and  checked  for  elongated  streaks  by  fitting 
rectangles  of  a  minimum  aspect  ratio  and  length, 
and  testing  for  shape  conformity.  Detected  streaks 
are  then  combined  with  shadow  information  from 
the  ML  labeling  stage  to  identify  buildings.  Shape 
constraints  are  used  to  aggregate  road  regions  into 
road  segments.  Finally,  since  the  operating  fre¬ 
quency  of  the  sensor  does  not  permit  foliage  pen¬ 
etration,  tree/grass  ambiguities  are  resolved  using 
supporting  shadow  evidence. 

To  improve  the  critical  ML  segmentation  stage  of 
the  site  model  construction  algorithms,  we  have  de¬ 
veloped  a  multiresolution  technique  for  segmenting 
single-polarization  SAR  intensity  images.  The  mo¬ 
tivation  for  the  development  of  a  multiresolution  al¬ 
gorithm  is  smoother,  more  accurate  segmentation  to 
facilitate  context  extraction  and  region-adaptive  de¬ 
tection.  For  instance,  existing  segmentation  algo¬ 
rithms  often  misclassify  pixels  at  tree-grass  bound¬ 
aries  in  high-resolution  imagery.  Misclassifications 
are  caused  because  tree  edges  are  very  similar  to 


grassy  regions  and  are  interspersed  with  shadows 
that  lower  the  first-order  statistics  into  a  range 
closely  matching  the  grass  class.  The  effect  is  usually 
a  ring  of  incorrectly  labeled  grass  pixels  surrounding 
regions  of  tree  canopy. 

Most  of  the  previous  multiresolution  work  (such  as 
[Krishnamachari  and  Chellappa,  1997]  and  its  ref¬ 
erences)  is  designed  for  texture  segmentation  and  is 
not  appropriate  for  the  speckled  images  produced  by 
SAR.  Fosgate  [Fosgate  et  al,  1997]  developed  a  SAR 
multiresolution  segmentation  algorithm  for  two-class 
(tree/grass)  segmentation  of  complex  data.  Linear 
prediction  across  scales  is  used  to  classify  the  fine- 
resolution  pixels.  An  ML  technique  based  on  the 
distributions  of  the  prediction  error  residuals  has 
been  effectively  used  to  solve  the  two-class  prob¬ 
lem.  A  problem  with  this  technique  is  the  refine¬ 
ment  procedure  needed  to  obtain  accurate  labeling 
at  boundaries  due  to  the  large  window  sizes  used  for 
classification.  Segmentation  into  a  larger  number  of 
classes  would  accentuate  the  problem  by  increasing 
the  number  of  region  boundaries. 

Our  multiresolution  segmentation  algorithm  utilizes 
the  Weibull  clutter  model  with  a  formulation  simi¬ 
lar  to  that  found  in  [Krishnamachari  and  Chellappa, 
1997].  The  Weibull  clutter  model  used  for  CFAR  de¬ 
tection  has  proved  reasonably  accurate  and  has  some 
characteristics  similar  to  the  K-distribution  that  has 
been  found  to  be  a  good  model  for  ADTS  clutter 
[Yueh  et  al,  1989].  The  Weibull  model  has  an  ad¬ 
vantage  over  the  K  model  in  that  the  log-likelihood 
function  is  simpler,  resulting  in  faster  implementa¬ 
tion.  Training  is  also  more  straightforward  because 
an  iterative  ML  algorithm  for  Weibull  parameter 
estimation  is  available  whereas  a  two-dimensional 
search  through  parameter  space  must  be  performed 
to  find  the  ML  estimate  of  the  K  parameters.  An¬ 
other  advantage  of  the  Weibull  model  is  that  it  is 
appropriate  for  both  magnitude  and  intensity  im¬ 
ages  due  to  the  fact  that  squaring  a  Weibull  random 
variable  produces  another  Weibull  random  variable. 

Let  Xs  and  Ls  be  the  observed  intensity  and  true 
label  at  the  pixel  location  s,  respectively.  Under  the 
Weibull  clutter  assumption,  the  conditional  distri¬ 
bution  of  Xs  given  the  label  /  can  be  expressed  as 


p{Xs\Ls^l)  =  ^^ 


C,-l 

exp 


where  the  parameter  set  {Bi,  Ci)  contains  the  shape 
and  scale  parameters  computed  for  class  /  from  the 
training  set.  A  sliding  window  is  used  to  form  the 
joint  log-likelihood  function  over  a  local  neighbor¬ 
hood,  for  each  class.  It  is  assumed  that  all  the  pixels 
within  the  window  have  the  same  label.  The  pixel 
under  test  is  given  the  class  label  which  maximizes 
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Figure  1:  Results  of  multiresolution  segmentation:  (a)  Original  TESAR  image,  (b)  single-resolution  ML 
segmentation  (light  gray=grass,  dark  gray=trees,  black=shadows),  (c)  multiresolution  ML  segmentation,  (d) 
result  of  post-processing  the  multiresolution  segmentation  image  with  CFAR  detections  overlaid  in  white. 
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this  joint  log-likelihood  function. 

In  the  multiresolution  formulation,  we  begin  with  a 
quadtree  with  M  levels  where  €  {0, 1, . . . ,  M  -  1} 
denotes  a  level  of  the  quadtree.  The  lowest  level, 
corresponding  to  I;  =  0,  is  the  original  fine-resolution 
N  X  N  SAR  intensity  image.  At  level  k,  the  image 
has  been  reduced  to  a  coarser-resolution  image  of 
size  X 

In  our  experiments,  we  tried  three  different  meth¬ 
ods  of  resolution  reduction— downsampling,  half¬ 
band  FIR  filtering,  and  averaging  the  four  child  pix¬ 
els.  Somewhat  surprisingly,  we  found  that  averaging 
the  intensity  values  produced  the  best  results.  We 
have  also  found  that  a  three-level  tree  works  well, 
with  little  improvement  obtained  by  adding  levels. 

Segmentation  begins  at  the  coarsest  level  of  the 
quadtree.  The  resulting  segmentation  result  is  de¬ 
noted  by  and  is  of  size  x  Associated  with 

is  the  confidence  measure,  defined  as  the 

difference  between  the  log-likelihood  value  of  the  ML 
label  and  the  log-likelihood  of  the  runner-up  label: 

CW  =  _  ^RU 

where  and  are  the  log-likelihoods  associ¬ 
ated  with  the  ML  label  and  the  runner-up  label  for 
pixel  s. 

The  segmentation  at  level  ^  -  1,  is  then  ini¬ 

tialized  using  a  zero-order  hold  in  each  dimension  to 
obtain 


The  confidence  measures  are  propagated  in  the  same 
way.  At  the  new  level,  any  pixel  with  confidence 
measure  below  the  threshold  will  be  relabeled 
based  on  estimates  from  the  higher-resolution  im¬ 
age.  We  expect  that  at  boundary  regions  the  coarse- 
resolution  window  will  contain  a  mixture  of  regions 
and  have  low  confidence.  Segmenting  at  the  higher 
resolution  should  produce  a  more  confident  labeling 
in  boundary  regions.  This  process  continues  until 
the  finest  scale  is  reached. 

Figure  1  shows  an  example  of  TESAR  imagery  seg¬ 
mented  into  three  classes  (grass,  trees  and  shadows) 
using  the  multiresolution  segmentation  scheme.  Fig¬ 
ures  1(b)  and  (c)  contrast  the  results  of  ML  seg¬ 
mentation  at  the  finest  resolution  and  the  multires¬ 
olution  segmentation.  Finally,  Figure  1(d)  shows 
the  multiresolution  segmentation  after  some  post¬ 
processing  and  reintroduction  of  the  CFAR-detected 
targets.  Post-processing  involved  removal  of  small 
regions  and  looking  for  supporting  shadow  evidence 
for  tree  regions. 


Figure  2:  TESAR  image  with  treeline  extracted 
from  grass/tree  boundary  after  segmentation. 


2.1  Treeline  extraction  and 

region-adaptive  detection 

One  method  of  utilizing  context  from  a  single  im¬ 
age  is  the  extraction  of  treelines  for  use  with  region- 
adaptive  target  detection  algorithms.  In  a  region- 
adaptive  detection  algorithm,  we  run  the  CFAR  de¬ 
tector  with  different  thresholds  in  different  regions. 
For  non-foliage-penetrating  radar,  one  could  run 
CFAR  with  a  low  false  alarm  rate  in  the  clear,  use 
a  different  false  alarm  rate  along  region  boundaries, 
and  omit  CFAR  processing  altogether  in  large  ho¬ 
mogeneous  forest  areas.  Lowering  the  threshold  in 
the  boundary  region  may  allow  detection  of  targets 
in  the  presence  of  tree  layover.  Thus,  extraction  of 
the  treeline  facing  the  sensor  becomes  crucial.  Par¬ 
tially  occluded  targets  along  the  trailing  treeline  can 
usually  be  detected  with  a  standard  CFAR  detec¬ 
tor  because  of  the  low  intensity  of  the  shadow  back¬ 
ground,  and  adjusting  the  CFAR  thresholds  along 
this  boundary  could  produce  a  large  number  of  false 
alarms  with  no  benefit  obtained.  Region  adaptive 
detection  and  treeline  extraction  are  applications 
where  the  smoothness  provided  by  our  multiresolu¬ 
tion  segmentation  is  important.  Without  a  smooth 
segmentation,  dominant  treelines  would  be  difficult 
to  detect  and  the  large  number  of  small  regions 
would  render  the  region-adaptive  algorithms  ineffec¬ 
tive. 

Figure  2  shows  an  original  TESAR  image  and  a  tree¬ 
line  extracted  by  the  multiresolution  segmentation 
algorithm.  After  extracting  tree/grass  boundary 
pixels,  contours  below  a  programmable  size  thresh¬ 
old  were  eliminated.  We  do  not  currently  have  im- 
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agery  with  good  examples  of  layover,  so  our  exam¬ 
ples  are  limited  to  boundary  extraction  at  this  time. 


3  Registration  and  Exploitation  of 
Multipass  SAR  Imagery 

Assuming  a  flat  earth,  a  large  range-to-swath-width 
ratio,  and  the  availability  of  acquisition  parameters, 
the  registration  of  two  images  acquired  from  an  air¬ 
borne  SAR  can  be  approximated  by  an  affine  trans¬ 
formation  [Kuttikkad  et  ai,  1997].  Thus,  a  2-D 
point,  in  the  first  image  can  be  transformed  to 
the  corresponding  point,  in  the  second  image 
via  the  transformation 


+  b  (1) 
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and  6  is  a  translation  which  can  be  determined 
from  GPS  information.  Here  0a;(*),  (5r(*)  are  the 
respective  pixel  resolutions  in  the  cross-range  and 
range  dimensions,  0(*)  are  the  depression  angles,  and 
(j)  =  (^(2)  —  (/ill)  jg  difference  in  sensor  headings 
between  the  two  images.  For  convenience,  the  affine 
.transformation  of  (1)  can  be  written  in  the  general 
matrix  formulation 


a;(2) 

an  ai2 

a;(^) 

r(2) 

= 

021  <222  G 

r(i) 

1 

0  0  1 

1 

In  order  to  compensate  for  errors  in  GPS-derived  lo¬ 
cation,  we  extract  a  number  of  point  features  from 
each  image  and  refine  the  translation  parameters 
from  them.  The  features  chosen  should  lie  in  (or 
near)  the  ground  plane,  so  that  there  are  no  layover 
effects  that  would  affect  different  views  differently. 
They  should  also  be  easy  to  detect  and  should  per¬ 
sist  across  images.  We  have  chosen  the  centroids  of 
clusters  of  bright  pixels  as  our  point  features.  These 
bright  returns  result  from  metallic  objects  and  other 
specular  reflectors  in  the  scene  which  may  lie  em¬ 
bedded  in  non-homogeneous  background  clutter.  In 
the  images  we  experimented  with,  they  consist  of 
stationary  vehicles  and  other  strong  reflectors,  like 


dihedrals  and  trihedrals.  The  Order  Statistic  CFAR 
[Rohling,  1983]  technique  is  used  to  detect  bright 
pixels  in  spatially- varying  clutter.  Terrain  backscat- 
ter  is  modeled  as  a  complex  Gaussian  resnlting  in  a 
Rayleigh  magnitude  distribution.  After  initial  regis¬ 
tration,  distances  between  each  feature  point  in  one 
image  and  all  featnre  points  in  the  other  image  are 
computed.  A  search  is  then  performed  to  find  the 
maximnm  nnmber  of  one-to-one  matches  that  result 
in  the  same  approximate  translation. 

An  example  of  registering  multipass  airborne  SAR 
data  is  illustrated  using  three  views  of  the  Stock- 
bridge  target  array  in  Figure  3.  The  alternate  views 
were  automatically  registered  with  the  reference  im¬ 
age  and  transformed  to  its  coordinate  system,  using 
our  affine  model  (bottom). 

3.1  Exploitation  of  registered 
imagery 

Co-registered  multi-pass  imagery  of  an  area  can  aug¬ 
ment  information  about  the  scene  and  resolve  ambi¬ 
guities.  We  present  some  applications  of  registered 
multi-pass  SAR  imagery  which  would  otherwise  be 
difficult  with  a  single  image.  Shadows  in  radar  im¬ 
ages  indicate  a  lack  of  backscatter  for  a  certain  du¬ 
ration  in  the  range  gating  window  immediately  be¬ 
hind  a  tall  object.  One  obvious  application  of  reg¬ 
istered  imagery  is  to  fill  in  the  missing  information 
in  shadow  regions.  The  missing  information  in  one 
image  can  be  supplemented  using  the  segmentation 
map  from  another  co-registered  image.  Another  use 
of  registered  multipass  imagery  is  object  height  es¬ 
timation  which  is  considered  next. 

3.2  Estimation  of  object  height 

Topography  reconstruction  from  a  stereo  pair  of 
SAR  images,  acquired  from  the  same  side,  opposite 
sides,  or  intersecting  flight  paths  has  been  demon¬ 
strated  ([Leberl,  1990],  chs.  13,14).  These  tech¬ 
niques  attempt  to  obtain  a  dense  height  map  from 
two  stereo  images.  Another  technique  for  recon¬ 
structing  terrain  heights  is  to  use  radar  interfer¬ 
ometry,  which  requires  two  coherently  acquired  im¬ 
ages  [Zebker  and  Goldstein,  1986].  We  consider  the 
problem  of  acquiring  heights  of  specific  structures 
from  a  pair  of  non-interferometric  SAR  images  col¬ 
lected  from  possibly  intersecting  flight  paths.  In  the 
case  of  high-resolution  airborne  SAR  imagery,  the 
presence  of  speckle  undermines  any  pixel-intensity- 
based  matching  technique  and  region-based  tech¬ 
niques  may  not  give  a  sufficiently  dense  height  map. 
Moreover,  registration  errors  may  be  on  the  order 
of  a  few  pixels,  leading  to  errors  in  pixel-by-pixel 
height  computations.  We  therefore  restrict  our  at¬ 
tention  to  detecting  heights  of  man-made  structures 
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Figure  3:  Registration  example:  Reference  image  (top),  alternate  views  (middle),  and  other  views  registered 
to  reference  image  (bottom). 
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like  buildings  and  natural  objects  such  as  treetops, 
which  are  easier  to  detect  and  match. 


Figure  4:  Observed  range  location  of  a  tall  object 


Let  be  the  affine  projection  (according 

to  (2))  of  in  the  second  image; 

-  A3{p^^^  -  hsine^^^  *  [0  1  0]^)  (3) 

where  A3  is  the  (3  x  3)  affine  transformation  matrix 
of  (2).  Given  that  the  locations  of  points  along  the 
base  of  the  structure  in  the  two  images  are  related 
by  =  A3  *p^^\  (3)  can  be  rewritten  as 

=  pf^  - /jsin6l(^)A3  *  [  0  1  0]^  (4) 

=  pj-^^  +  /i(sin^^^)l3-sin^^^^A3)  *  [0  1  0]^ 

where  J3  is  the  (3  x  3)  identity  matrix.  Substituting 
for  A3,  (4)  becomes 


Layover  causes  the  slant  plane  image  of  the  top  of  a 
structure  of  height  h  (Figure  4)  to  appear  at  range 
location 

—  hsin^*-'^ 

The  superscript  i  refers  to  image  i,  6  is  the  depres¬ 
sion  angle,  and  r  and  robs  are  the  slant  range  to  the 
base  and  top  of  the  structure,  respectively.  Mathe¬ 
matically,  it  is  possible  to  compute  the  height  of  a 
tall  object,  given  the  exact  location  of  a  single  point 
on  it  in  two  views,  and  the  transformation  between 
the  two  images.  In  practice,  the  difficulty  in  au¬ 
tomatically  localizing  point  features  in  SAR  images 
and  inaccurate  registration  lead  to  erroneous  height 
computation. 

Linear  features,  which  are  easier  to  detect  than  point 
features,  arise  in  SAR  terrain  imagery  due  to  the 
cardinal  streaks  of  buildings,  treelines,  road  edges, 
vegetation  boundaries,  etc.  We  are  specifically  in¬ 
terested  in  extracting  heights  of  buildings  and  trees. 
Due  to  layover,  the  tops  of  the  vertical  sides  of  build¬ 
ings  and  treelines,  closest  to  the  radar,  are  at  nearer 
ranges  than  their  bases.  These  features  can  be  au¬ 
tomatically  detected  using  segmentation/labeling  or 
can  be  manually  selected  from  the  image  pairs. 

Let  be  the  location  of  the  base  of  a  linear  struc¬ 
ture  of  height  h  in  image  i.  In  a  digital  image,  a  line 
Z  can  be  thought  of  as  a  collection  of  points  pj : 

z«  =  {pf}  =  {[^(0 

Since  layover  affects  the  range  location  of  the  top  of 
a  vertical  structure,  its  apparent  location  is  given  by 
the  collection  of  points 

pj‘^  =  pf'^  -  h  sin  *  [  0  1  0 


Although  it  is  not  possible  to  detect  pairs  of  corre¬ 
sponding  points  on  the  two  lines,  it  is  possible  to 
compute  height,  h,  by  computing  the  perpendicular 
distance,  p,  between  them: 

_ _ P _ 

(sin  0(2)  -022  sin  0(1))  sin  0-012  sin  0(i)  cos  a 

where  a  is  the  angle  made  by  a  line  perpendicular 

to  either  of  Z^  ^  or  Z^^^  with  the  positive  cross-range 
axis  in  the  second  image. 

In  practice,  it  is  difficult  to  ensure  that  the  projected 

line  Z^  ^  is  exactly  parallel  to  the  observed  line  Z*'^^ 
Since  the  above  technique  for  height  extraction  relies 
on  the  distance  between  two  parallel  line  segments 
and  their  slope,  refinements  have  to  be  made  to  the 
original  line  segment  locations,  to  make  them  par¬ 
allel.  In  order  to  achieve  this,  the  observed  lines  in 
the  two  images  are  de-rotated  by  equal  and  oppo¬ 
site  amounts  about  their  midpoints.  This  is  justi¬ 
fied,  since  the  line  extraction  technique  is  the  same 
in  both  images,  and  any  errors  are  expected  to  af¬ 
fect  both  images  in  the  same  statistical  sense.  It  is 
not  difficult  to  show  that  the  corresponding  rotation 
angle,  <p,  is  the  solution  of 

<221  +  _  —  sin  y> -b  m(2)  cos  ^ 

(2ii  +  ai2ffi(^)  cos  ^ -b  m(2)  sin 


where  m(')  is  the  slope  of  the  observed  line  in  image 


i,  and 


sin  (p  -b  m(^)  cos  (p 
cos  p  —  m(i)  sin  p 
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4  False  Alarm  Reduction  Using 
Context 

In  this  section,  we  demonstrate  the  use  of  con¬ 
text,  derived  from  single  or  multipass  imagery,  for 
false  alarm  reduction.  In  the  images  we  experi¬ 
mented  with,  stationary  vehicles,  cardinal  streaks 
from  buildings,  and  other  strong  reflectors,  like  di¬ 
hedrals  and  trihedrals,  produced  bright  returns.  If 
only  a  single  image  of  a  scene  is  available,  a  site 
model  can  provide  context  for  ATR  algorithms,  by 
identifying  regions  where  targets  may  be  expected  to 
be  present.  In  the  case  of  data  collected  from  mul¬ 
tiple  passes  over  the  same  site,  the  images  can  be 
registered  and  used  for  providing  consistency-based 
false  alarm  reduction. 

4.1  Context  from  a  single  image 

In  Section  2,  we  considered  schemes  for  delineating 
natural  regions,  such  as  forests  and  lakes,  as  well 
as  identifying  man-made  objects,  such  as  buildings 
and  roads.  Since  military  ground  vehicles  are  not 
expected  in  dense  forest  or  water,  bright  pixel  clus¬ 
ters  surrounded  by  these  labels  can  also  be  removed 
as  false  alarms. 

Figure  5  shows  the  SAR  image  of  an  urban  area. 
CFAR  processing  of  the  original  image  produced 
more  than  1500  candidate  target  clusters  in  this 
complex  scene.  Inspection  proved  less  than  10%  of 
these  clusters  to  be  cars.  The  remaining  are  false 
alarms  due  to  buildings,  railroads,  bridge  railings, 
etc.  We  use  the  site  model  from  the  backscatter 
image  to  create  a  mask  for  buildings.  False  alarm 
mitigation  begins  with  the  removal  of  streaks  from 
buildings  using  this  mask.  Discrimination  can  then 
be  performed  on  the  remaining  target  chips  using  a 
clutter  training  set. 

An  example  obtained  from  TESAR  imagery  is  shown 
in  the  top  image  of  Figure  6.  The  middle  image  is  the 
segmented  image  with  detection  results  overlaid.  In 
addition  to  several  false  alarms  in  foliage,  there  are 
two  partially  occluded  vehicles,  two  vehicles  in  the 
clear,  and  several  trihedrals  which  we  consider  to  be 
targets.  Based  on  the  segmentation  results,  we  can 
declare  detections  in  heavy  clutter  to  be  unreliable; 
these  are  marked  by  boxes  in  the  bottom  image.  Af¬ 
ter  extracting  the  context  by  finding  the  proportion 
of  the  different  labels  in  a  hollow  window  surround¬ 
ing  each  detection,  two  of  the  false  alarms  can  be 
eliminated  because  they  are  closely  surrounded  by 
trees  and  are  therefore  unlikely  to  be  ground  vehi¬ 
cles.  The  other  false  alarms  and  the  two  occluded 
targets  can  be  flagged  as  obstructed,  but  not  elimi¬ 
nated,  because  they  are  close  enough  to  the  treeline 
that  we  cannot  distinguish  isolated  trees  from  tar¬ 
gets. 


4.2  Context  from  multipass  imagery 

Registered  SAR  imagery  can  be  used  for  reducing 
the  number  of  candidate  targets.  Bright  pixels  de¬ 
tected  by  CFAR  processing  are  grouped  into  clusters 
and  small  clusters  are  eliminated.  False  alarms  due 
to  buildings  and  in  foliage  can  be  eliminated  using 
the  technique  described  earlier.  The  targets  may 
be  in  the  open,  partly  occluded  by  shadow  or  the 
layover  of  trees,  or  camouflaged.  Our  claim  is  that 
false  alarms  and  other  directional  reflectors  (e.g.  di¬ 
hedrals)  will  not  be  consistently  visible  in  multiple 
views,  whereas  complex  targets  can  be  observed  in 
all  views.  Therefore  we  look  for  consistent  bright 
pixel  clusters  in  multiple  views  to  reduce  the  false 
alarms.  We  take  care  of  occlusions  by  shadows  or 
trees  by  incorporating  them  in  our  consistency  check 
using  the  segmentation  map  of  the  scene.  We  do  not 
address  the  camouflage  issue  in  this  paper. 

A  second,  registered  view  of  Figure  6  is  shown  in 
Figure  7  (top  image)  and  the  registered  segmenta¬ 
tion  and  CFAR  results  are  shown  in  the  middle  im¬ 
age.  Using  the  new  CFAR  detections,  we  can  verify 
that  the  two  false  alarms  eliminated  using  only  the 
first  pass  were  truly  false  alarms  and  four  of  the  other 
five  false  alarms  can  also  be  eliminated.  With  the 
context  information  provided  by  segmentation,  these 
false  alarms  are  eliminated  because  corresponding 
detections  in  the  second  pass  do  not  exist  and  the 
detection  points  are  labeled  as  trees  or  clear.  The  re¬ 
maining  false  alarm  cannot  be  removed  because  the 
detection  point  in  the  second  image  is  in  shadow. 
We  can  also  verify  that  the  two  partially  occluded 
targets  are  indeed  targets  and  not  isolated  trees  be¬ 
cause  they  are  detected  in  clear  areas  of  the  second- 
pass  image.  The  bottom  image  of  Figure  7  shows 
the  result  with  all  but  one  false  alarm  eliminated. 
Figure  8  shows  examples  of  multipass  Stockbridge 
imagery,  which  demonstrate  false  alarm  reduction 
techniques  using  multiple  views.  The  top  im¬ 
age  shows  a  reference  view  with  CFAR-detected 
pixel  clusters  marked.  Real  and  calibration  targets 
(marked  by  boxes)  as  well  as  false  alarms  (marked 
by  ovals)  were  manually  identified.  The  middle  row 
shows  two  other  views  similarly  processed.  Notice 
that  some  calibration  targets  disappear  while  others 
appear.  Some  of  the  real  targets  are  either  partially 
or  completely  occluded  by  shadows  and  trees.  A  lot 
of  false  alarms  show  up  in  foliage  where  clumps  of 
the  canopy  are  surrounded  by  shadow  regions.  The 
reference  image  is  registered  with,  and  transformed 
to,  the  coordinate  system  of  each  of  the  two  new 
images,  using  the  method  described  in  Section  3. 
Consistent  target  clusters  are  marked  as  targets  in 
the  bottom  row,  while  others  are  removed.  In  the 
case  of  areas  with  no  overlap  after  registration,  the 
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Figure  6:  False  alarm  reduction  using  context 
derived  from  a  single  TESAR  image;  Original 
image,  result  of  two-pass  Weibull  OS  CFAR,  and 
unreliable  detections  (marked  with  boxes)  based 
on  multiresolution  segmentation. 


Figure  7;  False  alarm  reduction  using  registered 
TESAR  imagery:  Second  image,  CFAR  results 
on  the  new  image,  and  confirmed  detections  after 
combining  the  evidence  from  both  views. 


Figure  8:  False  alarm  reduction  using  context  derived  from  multiple  images.  Top:  Ground  truth  im¬ 
age  (boxes=targets,  ovals=false  alarms).  Middle:  Alternate  views  with  manually  classified  targets.  Bot¬ 
tom:  Automatically  classified  targets  using  context  from  ground  truth  image  (black  boxes=no  data,  dia- 
monds=possible  new  targets). 
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corresponding  clusters  are  marked  by  black  boxes. 

The  most  interesting  result  shows  up  in  the  bottom- 
right  image.  Here,  some  pixels  are  marked  by  di¬ 
amonds.  These  were  pixel  clusters  in  the  reference 
image  which  overlapped  with  regions  of  shadow  or 
tree  in  the  test  image.  An  example  of  the  former  is 
the  diamond  in  the  top  center,  and  of  the  latter,  the 
diamond  in  the  bottom  right.  The  first  represents 
a  target  completely  surrounded  by  shadow  as  indi¬ 
cated  by  the  segmentation  of  the  test  image,  and  the 
second  is  a  target  surrounded  by  a  tree  region.  In 
reality,  these  were  targets  which  were,  respectively, 
in  shadow  and  laid  over  by  a  tree  in  the  test  image. 

5  Discrimination 

We  first  remove  clusters  of  CFAR-detected  pixels 
which  are  determined  to  be  false  alarms,  either  from 
the  context  derived  from  a  single  image  or  because  of 
lack  of  consistency  in  multi-pass  imagery.  Candidate 
target  chips  are  generated  by  grouping  the  surviv¬ 
ing  pixels  and  computing  their  bounding  boxes.  A 
set  of  features  is  computed  for  each  target  chip  and 
compared  to  a  feature  set  derived  from  the  clutter 
training  set. 

We  use  a  combination  of  the  features  developed  at 
Lincoln  Laboratory  [Kreithen  tt  ai,  1993;  Novak  et 
ai,  1993]  for  single-polarization  data,  namely,  stan¬ 
dard  deviation,  weighted  rank  fill  ratio  and  fractal 
dimension,  and  features  related  to  the  size  of  the 
target  chip,  namely,  area,  length  and  width.  These 
features  are  summarized  in  Table  1 .  Pij  is  the  pixel 
at  the  (i,i)th  image  location,  and  P^p  are  those 
pixels  that  belong  to  the  target  chip  N  is  the 

total  number  of  pixels  in  the  target  chip,  and  X  and 
y  are  the  dimensions  (in  image  coordinates)  of  the 
upright  bounding  box  of  the  region  {X  >  Y). 

If  templates  of  targets  are  available  we  use  a  target 
training  set  and  compare  features  of  templates  to 
those  of  target  chips.  If  not,  since  the  set  of  possi¬ 
ble  types  of  targets  is  quite  large,  we  compare  the 
features  extracted  from  the  candidate  target  chips 
to  those  derived  from  clutter  only.  As  many  clutter 
training  sets  are  built  as  there  are  types  of  clutter  ob¬ 
served  in  a  standard  training  image.  A  Mahalanobis- 
like  distance  measure  is  computed  for  each  chip; 

Zi  =  -x{Xi-  Mtrf  X  Sr/  X  {Xi  -  Mtr)  (5) 
n 

where  n  is  the  number  of  features  used.  This  dis¬ 
tance  is  compared  to  a  trained  threshold  to  discrim¬ 
inate  non-clutter  objects  from  clutter. 

Examples  of  discrimination  applied  to  two  images 
from  Lincoln  Lab’s  Stockbridge  Target  Array,  con¬ 
sisting  of  7  to  8  targets,  are  shown  in  Figure  9. 
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CFAR  processing  produced  approximately  20  candi¬ 
date  target  chips.  We  used  Xpatch-generated  tem¬ 
plates  of  two  targets  (M48  and  Ml  13)  to  compute 
the  target  features.  All  features  of  candidate  target 
chips  are  compared  to  those  of  the  training  set  using 
the  distance  measure  of  (5).  The  threshold  for  dis¬ 
criminating  targets  from  clutter  was  trained  using 
one  view  and  applied  to  the  others.  Figure  9  shows 
the  results  of  reduction  of  false  alarms.  Three  sets 
of  discrimination  features  were  tested — Lincoln  Lab¬ 
oratory  features  only  (standard  deviation,  weighted 
rank  fill  ratio,  fractal  dimension),  fractal  dimension 
and  size  features  (area,  length,  width),  and  a  com¬ 
bination  of  all  the  features.  Table  2  shows  the  re¬ 
sults  of  discrimination  applied  to  three  passes  of  the 
Stockbridge  Target  Array.  The  distance  threshold 
for  discrimination  was  set  to  retain  all  the  targets. 
Performance  can  be  judged  by  looking  at  the  num¬ 
ber  of  false  alarms  in  the  output  of  the  discrimi¬ 
nator.  We  found  that  Lincoln  Laboratory’s  single¬ 
polarization  features  alone  were  not  sufficient  for 
good  discrimination.  The  least  false  alarms  were 
produced  by  the  size  and  shape  features,  which  seem 
to  work  best  at  this  resolution. 

6  Conclusion 

Use  of  context  can  greatly  reduce  the  burden  on 
ATR  algorithms  by  reducing  the  number  of  can¬ 
didate  targets.  We  define  context  to  be  a  region 
delineation  for  a  single  image  and  detection  consis¬ 
tency  across  registered  imagery  for  a  multipass  sce¬ 
nario.  For  the  single-pass  case,  context  is  derived 
using  intensity  data  and  a  multiresolution  segmen¬ 
tation  scheme.  We  have  also  formulated  the  reg¬ 
istration  equation  for  images  of  the  same  site  ac¬ 
quired  from  an  airborne  SAR  platform.  Examples  of 
context-derived  false  alarm  reduction  for  both  cases 
were  presented.  Finally,  a  set  of  shape-  and  texture- 
based  features  was  used  for  discriminating  targets 
from  clutter.  Experimental  results  using  TESAR 
and  ADTS  data  were  presented. 

7  Acknowledgments 

We  wish  to  thank  Don  Wineberg  and  Vince  Zagardo 
of  Northrop  Grumman  Corporation  for  obtaining 
permission  to  publish  the  TESAR  imagery.  We  also 
wish  to  thank  Les  Novak  of  MIT  Lincoln  Laboratory 
for  providing  the  ADTS  data. 


Features 

Expression 

Comments 

Standard  Deviation 

Si  =  EcCM  [10  logic  -Pijj'for  i  =  1, 2. 

Weighted  Rank  Fill 
Ratio 

Y'  1  brightest  pll"* 

R=  (100  X  100)  square 

1  =  5%  of  size-of(R) 

Fractal  Dimension 

ED  =  log2(Mi/M2) 

Mi  =  No.  of  boxes  of  size  i  reqd.  to  fill 
target  cluster 

Area 

^ ij 

Length 

L  =  VX2  +  y2 

If  orientation  =  45®,  135° 

=  max(Al,  y) 

If  orientation  =  0® ,  90® ,  or  none 

Width 

Table  1:  Feature  set  for  discrimination 


Features 

Pass  1  (18  detections) 

Pass  2  (16  detections) 

Pass  3  (25  detections) 

Targets 

False  Alarms 

Targets 

False  Alarms 

Lincoln  Laboratory 

16 

4 

13 

4 

15 

5 

Size  and  Shape 

1 

10 

1 

10 

0 

All  Together 

15 

3 

2 

14 

4 

Table  2:  Results  of  the  discrimination  for  the  Stockbridge  target  array  data 
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Figure  9:  Discrimination  example;  boxes  are  targets 
and  ovals  are  false  alarms. 
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Abstract 

We  present  a  parametric  model  for  radar  scat¬ 
tering  as  a  function  of  frequency  and  aspect  an¬ 
gle.  The  model  is  used  for  analysis  of  synthetic 
aperture  radar  data.  The  estimated  parame¬ 
ters  provide  a  concise,  physically  relevant  de¬ 
scription  of  measured  scattering  for  use  in  tar¬ 
get  recognition,  data  compression  and  scatter¬ 
ing  studies.  The  scattering  model  and  an  image 
domain  estimation  algorithm  are  applied  to  two 
measured  data  examples. 

1  Introduction 

At  high  frequencies,  the  scattering  response 
of  an  object  is  well  approximated  as  a  sum 
of  responses  from  individual  scattering  centers 
[Keller,  1962].  These  scatterers  provide  a  phys¬ 
ically  relevant,  yet  concise,  description  of  the 
object  and  are  thus  good  candidates  for  use 
in  target  recognition,  radar  data  compression, 
and  scattering  phenomenology.  In  this  paper  we 
consider  the  analysis  of  radar  data  measured  as 
a  function  of  frequency  and  aspect  angle.  We 
develop  a  parametric  scattering  model  for  this 
two-dimensional  problem.  The  model  is  based 
on  both  the  physical  optics  and  the  geometric 
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theory  of  diffraction  (GTD)  monostatic  scatter¬ 
ing  solutions  and  extends  the  one-dimensional 
GTD  model  presented  in  [Potter,  et  ah,  1995] 
to  include  aspect  angle.  Our  model  provides  a 
physical  description  of  target  scattering  centers, 
each  of  which  is  described  by  a  set  of  parameters 
describing  position,  shape,  orientation  (pose) 
and  relative  amplitude.  This  is  a  richer  descrip¬ 
tion  of  target  scattering  than  is  available  either 
from  conventional  Fourier-based  imaging  tech¬ 
niques  [Mensa,  1991]  or  from  less  physically  ac¬ 
curate  point  scattering  parametric  models  [Tu, 
et  al,  1997]. 

Recent  developments  in  mechanism  extraction 
from  two-dimensional  radar  data  [Tu,  et  al., 
1997],  [Sacchini,  et  al.,  1993]  are  based  on  the 
assumption  that  scattering  centers  are  localized 
to  isolated  points.  While  this  description  is 
valid  for  many  scattering  centers  at  many  as¬ 
pect  angles,  some  common  scattering  mecha¬ 
nisms  behave  as  distributed  elements,  and  point 
scattering  models  fail  to  accurately  model  the 
scattering.  The  aspect  dependence  in  our  two- 
dimensional  model  allows  description  of  both  lo¬ 
calized  and  distributed  scattering  centers,  pro¬ 
viding  a  higher  fidelity  description  of  scattered 
fields.  The  improved  model  provides  the  po¬ 
tential  both  for  improved  data  compression  and 
for  the  discrimination  of  localized  versus  dis¬ 
tributed  scattering  mechanisms. 

We  develop  a  simple  parametric  model  of  far- 
held  scattering  as  a  function  of  frequency  and 
aspect  angle.  The  parameters  are  treated  as  un¬ 
known,  deterministic  quantities.  We  present  an 
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algorithm  to  estimate  the  parameters  from  an 
image  domain  representation  of  the  measured 
data.  Estimation  in  the  image  domain,  rather 
than  in  the  frequency-aspect  domain,  provides 
the  advantages  of  clutter  suppression,  model  or¬ 
der  reduction,  and  computational  savings. 

The  paper  is  organized  as  follows.  In  Sec¬ 
tion  2  we  present  the  two-dimensional  scatter¬ 
ing  model  based  on  GTD  and  physical  optics 
solutions  for  simple  geometries.  In  Section  3 
we  transform  the  frequency-aspect  angle  do¬ 
main  model  into  the  image  domain  for  the  pur¬ 
pose  of  parameter  estimation.  In  Section  4  we 
present  an  algorithm  for  estimation  of  the  un¬ 
known  parameters  of  the  model  from  measured 
data.  In  Section  5  we  present  the  Cramer-Rao 
lower  bound  (CRB)  on  the  variance  of  the  esti¬ 
mated  model  parameters,  and  discuss  practical 
implications  of  the  CRB  for  parameter  uncer¬ 
tainty.  In  Section  6  we  present  experimental 
results  obtained  by  applying  our  estimation  al¬ 
gorithm  to  data  measured  in  a  compact-range 
anechoic  chamber. 

2  Model  Development 

We  develop  a  parametric  model  for  the 
backscatter  from  objects  measured  as  a  function 
of  frequency  and  aspect  angle.  We  seek  a  model 
that  maintains  high  fidelity  to  the  scattering 
physics  for  many  objects,  yet  is  sufficiently  sim¬ 
ple  in  its  functional  form  to  permit  robust  in¬ 
ference  from  estimated  parameters. 

For  this  development,  we  assume  a  data  collec¬ 
tion  scenario  consistent  with  synthetic  aperture 
radar  (SAR)  imaging.  A  reference  point  is  de¬ 
fined,  and  we  require  that  the  radar  trajectory 
and  reference  point  are  co-planar.  We  label  this 
imaging  plane  using  an  x  —  y  Cartesian  coordi¬ 
nate  system  with  origin  at  the  reference  point. 
The  radar  position  is  then  described  by  an  angle 
(j)  defined  counterclockwise  from  the  x  direction. 
We  assume  far  zone  backscatter,  and  therefore 
obtain  plane-wave  incidence  on  objects. 

From  the  geometric  theory  of  diffraction  (CTD) 
[Keller,  1962]  and  its  uniform  version  [Kouy- 
oumjian  and  Pathak,  1974],  if  the  wavelength 
of  the  incident  excitation  is  small  relative  to  the 


target  extent,  then  the  backscattered  field  from 
an  object  consists  of  contributions  from  electri¬ 
cally  isolated  scattering  centers.  In  developing 
our  model,  we  proceed  in  a  similar  fashion  and 
characterize  the  frequency  and  aspect  angle  de¬ 
pendence  of  individual  scattering  centers.  Each 
scattering  center  is  described  by  a  small  number 
of  parameters.  The  total  scattered  field  from  a 
target  is  then  modeled  as  the  sum  of  these  in¬ 
dividual  scatterers. 

We  make  three  assumptions  about  the  far  zone 
backscattered  field,  and  each  assumption  leads 
to  the  functional  form  for  a  portion  of  our  scat¬ 
tering  model.  First,  phase  dependence  is  lin¬ 
ear  and  defined  by  the  position  of  the  scatter¬ 
ing  center.  Second,  amplitude  dependence  on 
frequency  is  defined  by  the  high-frequency  ap¬ 
proximation  derived  from  the  CTD.  Third,  am¬ 
plitude  dependence  on  aspect  angle  is  defined 
by  characterizing  the  scattering  center  as  either 
spatially  localized  or  distributed.  We  consider 
these  three  dependencies,  each  in  turn,  to  arrive 
at  a  parametric  scattering  model. 

First,  we  consider  only  far-field  scattering  with 
a  linear  phase  dependence  with  frequency.  The 
phase  of  a  scattering  center,  at  a  given  aspect 
angle,  is  determined  by  the  down  range  position 
of  the  scatterer.  Accordingly,  the  backscattered 
field  of  the  n*^’'  scattering  center  is  expressed 

En{k,  4>)  =  Sn{k,  4>)  exp{-i2fcf-  •  fjj  (1) 

where  k  =  27r//c  is  the  wave  number,  /  is  fre¬ 
quency  in  Hertz,  c  is  the  propagation  velocity, 
<l)  is  the  aspect  angle,  r  is  the  unit  vector  in  the 
direction  of  the  scattered  field,  and  fn  =  [xn,y7i] 
is  the  position  vector  of  the  n^'"  scattering  cen¬ 
ter  projected  to  the  plane.  The  time 

convention  is  assumed  and  suppressed.  Here  we 
consider  only  the  co-polarized  field;  as  such,  all 
field  quantities  are  written  as  scalars.  The  de¬ 
velopment  is  easily  extensible  to  multiple  polar¬ 
izations  [Chiang,  1996].  In  summary,  the  phase 
dependence  of  our  model  describes  the  location 
of  each  scattering  center  in  the  plane  of  the 
radar  measurement. 

Second,  we  consider  the  amplitude  dependence 
on  frequency.  In  presenting  the  CTD,  Keller 
[Keller,  1962]  uses  a  conservation  of  energy  ar- 
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Table  1:  Alpha  values  for  canonical  scatterers. 


a. 

Example  scattering  geometries 

1 

flat  plate  at  broadside;  dihedral 

— I — 
2 

singly  curved  surface  reflection 

0 

point;  sphere;  straight  edge  specular 

1 

2 

edge  diffraction 

-1 

corner  diffraction 

gument  to  propose  that  the  field  diffracted  from 
a  point  on  an  edge  is  proportional  to  {jk)~i, 
and  the  field  diffracted  from  a  vertex  is  propor¬ 
tional  to  ijk)~^.  The  simplicity  of  the  GTD  is 
that  many  practical  object  geometries  give  rise 
to  a  sum  of  these  two  scattering  mechanisms.  In 
Plonus,  et  ah,  1978]  and  [Ross,  1966],  it  is  shown 
that  in  addition  to  the  edge  and  vertex  diffrac¬ 
tion,  a  larger  class  of  scattering  geometries  also 
fit  the  (j/c)"  power  dependence  on  frequency, 
where  the  parameter  a  has  a  half  integer  value 
(see  Table  1). 

Third,  we  consider  aspect  dependence  of  scat¬ 
tering  amplitude.  As  aspect  angle  as  varied, 
we  assume  that  a  scattering  center  behaves  in 
one  of  two  ways:  either  a  scatterer  is  localized 
and  appears  to  exist  at  a  single  point  in  space, 
or  it  is  distributed  in  the  imaging  plane  and 
appears  as  a  finite,  nonzero-length  current  dis¬ 
tribution.  The  amplitude  dependence  on  aspect 
angle  is  different  for  each  of  these  scenarios,  and 
we  seek  a  model  that  accounts  for  both  scatter¬ 
ing  behaviors  in  a  physically  accurate,  yet  sim¬ 
ple,  functional  form. 

Examples  of  point  mechanisms  are  trihedral  re¬ 
flection,  corner  diffraction,  and  edge  diffraction. 
All  of  these  mechanisms  have  slowly  varying 
amplitude  as  a  function  of  aspect  angle.  We 
exploit  the  commonality  of  point  mechanisms 
by  modeling  this  slowly  varying  function  with  a 
damped  exponential 

4>)  =  An  exp(-27r/7„  sin  (2) 

The  exponential  function  provides  a  mathemati¬ 
cally  convenient  approximation  containing  only 
a  single  parameter.  Although  physical  insight 
is  used  to  arrive  at  the  exponential  model,  the 
parameter  7.„  has  no  direct  physical  interpreta¬ 


tion. 

On  the  other  hand,  examples  of  distributed 
scattering  mechanisms  are  flat  plate  reflec¬ 
tion,  dihedral  reflection,  and  cylinder  reflection. 
Each  of  these  scattering  mechanisms  has  an  am¬ 
plitude  dependence  on  aspect  angle  that  con¬ 
tains  a  sinc(x)  =  51!^  function.  In  all  cases 
this  sinc(x)  function  is  the  dominant  term  in 
the  physical  optics  far-zone  scattering  solution, 
and  we  adopt  the  sinc(x)  function  to  character¬ 
ize  angle  dependence  in  the  scattering  model  for 
scattering  centers  that  are  distributed: 

5'n(/,  4)  -  A„sinc  {kLn  Au{4  -  </)„))  (3) 

where  I/„  is  the  length  and  4>n  is  the  orientation 
angle  of  the  distributed  scatterer. 

We  combine  the  different  model  terms  from 
the  point  and  the  distributed  scattering  mech¬ 
anisms  to  write  our  2-D  scattering  model  in  a 
single  expression 

Ei{f,4)  = 

•sine  sm{cj)  -  4n)j 

•exp(-27r/7„sin0) 

,  .  47r/ ,  ,  .  ,  s  \ 

•  exp  ( -j - (x,i  cos  0  -b  'i/n  sm  (p) ) 

c 

(4) 

where  =  0  if  the  scattering  center  is  local¬ 
ized,  and  7„  =  0  if  the  scatterer  is  distributed. 
The  parameter  A„  is  a  relative  amplitude  for 
each  scattering  center.  The  total  scattered  field 
is  a  sum  of  p  individual  scattering  terms. 

E^{f,4)  =  j2E:MA)  (5) 

n=l 

The  scattering  model  in  Eqn.  5  is  a  function 
of  frequency  and  aspect  angle  and  is  described 
by  the  parameter  set  (x„,  a„,  7„,  (/>„)  for 

n  =  l,...,p.  The  parameters  provide  a  rich 
physical  description  of  the  scatterers  that  are 
present  in  the  data  set.  Each  parameter,  with 
the  exception  of  7„,  has  a  direct  physical  inter¬ 
pretation.  Example  scattering  geometries  dis¬ 
tinguishable  by  their  (a,  L)  parameters  are  pre¬ 
sented  in  Table  2.  The  model  is  based  on  scat¬ 
tering  physics  and  is  developed  to  describe  a 
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Table  2:  Parameters  a  and  L  serve  to  discrim¬ 
inate  many  scattering  geometries. 


Example  scattering  geometries 

a 

L 

dihedral 

1 

L^O 

corner  reflector 

1 

0 

cylinder 

i 

2..  ■■ 

L/0 

sphere 

0 

0 

edge  diffraction 

0 

L^O 

corner  diffraction 

-1 

0 

double  corner  diffraction 

-2 

0 

large  class  of  scatterers  while  still  maintaining 
a  relatively  simple  form. 

3  Transformation  of  Model  into 
Image  Domain 

Measured  radar  data  collected  as  a  function  of 
frequency  and  aspect  is  commonly  processed  co¬ 
herently  to  form  an  image  for  display  and  inter¬ 
pretation.  The  magnitude  or  envelope  of  the 
complex- valued  image  provides  an  intuitive  pic¬ 
ture  of  the  scattering  behavior  of  the  target. 

The  image  domain  provides  several  advantages 
for  estimation  of  the  unknown  parameters  in 
Eqn.  4.  The  four  main  advantages  are:  (1)  clut¬ 
ter  suppression,  (2)  reduction  in  local  model  or¬ 
der,  (3)  reduction  in  computation  cost,  and  (4) 
insertion  into  multi-stage  SAR  target  detection 
processing.  These  advantages  are  a  result  of 
the  fact  that  radar  imagery  provides  a  tempo¬ 
ral  decomposition  of  the  measured  data.  In  the 
following  paragraphs,  we  consider  each  of  these 
advantages. 

First,  in  the  image  domain  we  apply  the  model 
to  high  energy  regions  in  order  to  accomplish 
clutter  suppression.  There  exist  many  image 
segmentation  algorithms  [Stach  and  LeBaron, 
1996]  to  automatically  parse  a  radar  image  into 
high  energy  regions.  The  highest  energy  regions 
are  assumed  to  contain  target  scattering  centers 
of  interest,  while  the  lower  energy  regions  are 
assumed  to  contain  predominately  background 
clutter  scattering.  Since  the  scattering  model  in 
Eqn.  4  does  not  effectively  model  clutter  behav¬ 
ior  with  low  model  order,  clutter  energy  must 
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be  suppressed  in  order  to  ensure  low-variance 
parameter  estimates. 

Second,  each  segmented  high-energy  image  re¬ 
gion  is  assumed  to  be  electrically  isolated  from 
other  peak  regions.  We  therefore  process  each 
region  in  parallel  with  a  local  model  order  that 
describes  the  number  of  scattering  centers  in 
the  region  only.  A  reduction  in  estimation  com¬ 
plexity  is  therefore  achieved  by  the  divide-and- 
conquer  approach  to  model  order;  model  for  an 
entire  image  (or  image  chip)  can  be  quite  large, 
while  an  individual  peak  region  may  have  local 
model  order  of  five  or  less. 

Third,  image  domain  processing  of  each  peak  re¬ 
gion  reduces  computation  complexity  by  reduc¬ 
ing  the  number  of  pixels  considered  in  the  least- 
squares  fit  of  the  parametric  model  to  the  mea¬ 
sured  data.  Fourth,  image  domain  processing 
allows  insertion  of  the  model-based  scattering 
analysis  into  a  multi-staged  automatic  target 
recognition  algorithm.  The  model-based  scat¬ 
tering  analysis  is  performed  only  after  a  compu¬ 
tationally  inexpensive  prescreening  stage  [No¬ 
vak,  et  ah,  1993],  and  the  image  domain  rep¬ 
resentation  conveniently  combines  both  motion 
compensation  data  and  antenna  measurements. 

The  model  in  Eqn.  4  describes  scattering  in 
the  frequency-aspect  domain.  In  order  to  ac¬ 
complish  image  domain  parameter  estimation, 
we  transform  the  scattering  model  from  the 
frequency-aspect  domain  into  the  image  do¬ 
main.  We  process  the  parametric  model  us¬ 
ing  the  same  series  of  operations  through  which 
the  motion-compensated  frequency- aspect  an¬ 
gle  measurements  would  pass  during  image  for¬ 
mation.  There  are  many  methods  for  image 
formation  [Mensa,  1991],  but  we  limit  the  dis- 
cu.ssion  in  this  work  to  the  two-dimensional  In¬ 
verse  Fourier  Transform  (IFT)  of  the  measured 
frequency- aspect  data.  This  imaging  algorithm 
is  widely  used  in  SAR  systems  for  which  the 
center  frequency  of  the  radar  is  large  compared 
to  the  bandwidth  of  the  radar.  In  order  to 
transform  the  frequency-aspect  model  into  the 
image  domain,  we  analytically  perform  a  two- 
dimensional  Inverse  Fourier  Transform  on  the 
scattering  model  of  Eqn.  4. 


We  begin  with  the  model  in  Eqn.  4. 

•sine  (^^Ln  sin{(j)  - 

■exp(-27r/7nsin  </>) 

47r/ 

•  exp(-j - {xn  cos(p  +  yn  sin^)) 

c 

First,  we  replace  the  power  dependence  of  am¬ 
plitude  on  frequency  with  an  exponential,  as  in 
[Chiang,  1996]: 

~  (-27r7-„/)  (6) 

where  r,j  is  a  damping  factor.  We  let  the 
term  be  absorbed  into  the  complex  amplitude, 
An-  We  adopt  the  following  affine  map  from  r„ 
to  a„,  as  proposed  in  [Chiang,  1996]. 

^  (-27rA/V„)  -  1}  (7) 

The  expression  in  Eqn.  7  is  extremely  accurate 
for  small  relative  bandwidths  [Chiang,  1996]. 
For  example,  at  ten  percent  relative  bandwidth 
the  approximation  in  Eqn.  6  has  less  than 
0.0001%  relative  error.  As  the  bandwidth  in¬ 
creases  this  error  increases.  Using  this  approxi¬ 
mation,  we  first  estimate  and  then  map  it  to 

Oin. 

Second,  we  translate  the  model  from  polar  co¬ 
ordinates  to  Cartesian  coordinates  via  the  sub¬ 
stitution 

/x-  =  /  cos  (f) 

fy  =  /sin(/i  (8) 

By  making  this  coordinate  transformation  to 
the  Cartesian  frequency  plane,  we  assume  that 
the  measured  data  is  sufficiently  narrow  in 
bandwidth  so  as  to  allow  simple,  approximate 
interpolation  [Munson,  et  ah,  1983]  to  a  rectan¬ 
gular  grid.  We  further  approximate 


Third,  frequency  and  angle  domain  window 
functions  are  used  in  SAR  imaging  for  sidelobe 
suppression.  We  assume  that  the  window  func¬ 
tions  are  separable  in  their  Cartesian  compo¬ 
nents  and  can  be  written  as 

WihJy)  =  WAh)Wy{fy) 

P 

WAfx)  =  ^Bp"exp(j27r/3p"/,) 

p=i 

Q 

Wyify)  =  5:R^xp(j27r/l,V,)  (10) 

'7=1 

We  note  that  many  commonly  used  window 
functions  such  as  Rectangular,  Hamming,  and 
Taylor  windows  can  be  exactly  written  as  in 
Eqn.  10.  Inclusion  of  window  parameters  in 
the  model  adds  the  versatility  needed  for  cases 
where  only  image  domain  data  is  available  and 
the  effect  of  the  window  is  present  in  the  im¬ 
age.  The  cost  of  including  the  window  function 
is  increased  model  complexity. 

Fourth,  we  transform  E^{fx,fy)  to  the  image 
domain  with  a  two-dimensional  inverse  Fourier 
Ti'ansform.  Note  that  in  practice  measured  data 
exists  at  a  finite  number  of  discrete  frequencies 
and  aspect  angles.  As  a  result,  the  IFT  per¬ 
formed  to  generate  radar  imagery  is  typically 
an  Inverse  Discrete  Fourier  Transform  (IDFT). 
Here  we  analytically  perform  a  continuous  IFT 
for  simplicity.  In  fact,  the  alternative  image 
domain  model  using  the  IDFT  is  not  available 
in  closed  form.  The  IDFT  is  approximately 
equal  to  the  continuous  IFT  when  the  image  do¬ 
main  signal  is  essentially  support-limited.  Since 
most  radar  imagery  contains  a  small  number 
of  high  energy  regions  that  are  limited  in  ex¬ 
tent,  the  sampling-induced  aliasing  is  negligi¬ 
ble.  Thus,  we  assume  that  the  sampled  IDFT 
is  well-approximated  by  a  continuous  IFT  for 
radar  imagery. 

The  image  domain  model  e®  {tx,ty)  for  a  single 
scattering  center  is  then  written  as 


27r/r„  w  2TTfxrn  (9) 

in  the  frequency-dependent  exponential  of 
Eqn.  6;  this  approximation  is  valid  for  small  an¬ 
gle  spans. 


•sine 


2'kL  cos  (pt 
c 


Uy  -  fxtan(ly) 
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*  exp  -2Tr fy  ^7,1  +  j  -  Pq  ~ 'ty 

*  exp  -2-Kfx  (rn  +  j  ( -  Pp  -tx 


dfxdfyj  (11) 

where  *  denotes  convolution,  fxi,  fx2  are  the 
first  and  last  fx  frequencies,  and  fyi,  fy2  are 
the  first  and  last  fy  frequencies. 

As  discussed  in  Section  2,  either  Ln  =  0  or  7n  = 
0,  and  we  consider  each  case  separately.  If  Ln  — 
0,  evaluation  of  the  integrals  in  Eqn.  11  yields 

P=l  q=l 

■  exp  ^-27r/xc  (^I'n  +  j  ( — -  -  Pv  ~ 


where  hiK)  is  defined  in  the  Appendix  and 

Ki  =  27r(/j;2  -  Aitan^t) 

K2  =  27r(/j;i  - /xitan^t) 

ATs  =  27r(/yi  - /x2tan(j6t) 

Ki  =  27r(/j,2  - /x-2tan9^() 


•  exp  (  -2'Kfyc  ( 7n  +  3  -  fdq  -i-. 


■sinhc  y’n  +  j 

•sinhc  (-n-Fy  {^n  +  3 


where 


=  '<'n  +  3 


-  tan  ft  (^-y 


Fx  —  fx2  fxl 

Fy  =  fy2  -  fyl 

f  _  fx2  +  fxl 
J  xc  2 

f  _  +  fyl 

Jyc  -  2 

.  ,  ,  ^  sinh(2;) 

smhc(x')  =  — - — 

The  model  in  Eqn.  12  is  for  L  =  0,  which  corre¬ 
sponds  to  a  localized  scattering  mechanism.  In 
the  image  domain,  the  localized  mechanism  is 
represented  by  two  separable  functions  in  tx  and 
ty,  each  of  which  appears  as  a  sinhc(3;)  function. 

On  the  other  hand,  if  7„  =  0,  then  evaluation 
of  the  integrals  in  Eqn.  11  yields 

en{tx,ty)  =  Y.  I]  jSn'^L  cos  ft 

p=lg=l 

•  exp  {-2nfxci^) 

*  {exp  {TfFxv)  [h  {Ki)  -  h2  (ATz)] 
-I-  exp  {-ttFxIa)  [I2  (ATs)  -  ^2  {I^a)]} 


As  noted  above,  several  approximations  are 
made  in  arriving  to  Eqns.  12  and  13.  We  ver¬ 
ify  these  approximations  by  comparing  the  im¬ 
age  domain  model  to  an  image  directly  formed 
by  applying  the  IDFT  to  a  128  x  128  array 
of  polar-format  samples  given  in  Eqn.  4.  The 
parameters  chosen  for  this  example  verification 
are  L  =  10m,  a  =  I,  x  =  Ihn,  y  =  I2ni, 
f  =  1.0°  and  A  =  1.  The  relative  error  between 
the  image  domain  model  and  the  transformed 
frequency-aspect  domain  data  for  this  example 
is  less  than  0.1%.  This  implies  that  the  several 
approximations  made  in  obtaining  the  paramet¬ 
ric  model  in  the  image  domain  do  not  contribute 
significant  error. 

4  Curve  Fitting 

In  this  section  we  present  an  approximate  Max¬ 
imum  Likelihood  (ML)  technique  for  estimat¬ 
ing  the  parameters  of  the  image  domain  scat¬ 
tering  model.  For  each  of  p  scattering  centers, 
there  are  eight  real- valued  parameters  to  be  esti¬ 
mated;  the  amplitude  and  phase,  A„,  frequency 
damping  r„,  aspect  damping  7„,  length  Ln,  tlF 
angle  </>„,  down  range  position  Xn,  and  cross 
range  position  pn-  For  the  case  where  L  0, 
we  require  jn  =  0,  whereas  L  =  0  implies  fn  is 
not  estimated. 

Application  of  the  image  domain  model  requires 
several  radar  sensor  and  image  processing  pa¬ 
rameters.  These  image  formation  parameters 
are  center  frequency,  bandwidth,  total  angle 
^  span  (aperture),  numbers  of  (interpolated)  fx 
and  fy  frequency  and  angle  samples,  size  of  any 
zero-padding  used  in  the  IFFT,  and  the  window 
functions  used  in  down  range  and  cross  range. 
\  We  assume  that  the  image  is  generated  using 
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the  2-D  IFFT  of  the  measured  frequency-aspect 
data.  Further,  we  assume  that  the  bandwidth  of 
the  radar  data  is  sufficiently  narrow  to  allow  for 
the  approximation  of  the  polar  frequency- aspect 
data  as  lying  on  a  rectangular  fx,  fy  grid.  If  this 
is  not  the  case,  it  is  assumed  that  a  2-D  interpo¬ 
lation  to  the  Cartesian  frequency  plane  is  done 
prior  to  the  IFFT. 

The  initial  step  in  our  algorithm  is  to  segment 
the  image  into  small  image  chips,  each  of  which 
contains  a  small  number  of  scattering  centers. 
Using  the  image  domain  model,  a  curve  fit  is 
then  computed  for  each  image  chip.  There  ex¬ 
ist  automatic  segmentation  algorithms  [Stach 
and  LeBaron,  1996];  alternatively,  the  image 
can  be  segmented  visually  with  human  inter¬ 
action.  Whichever  segmentation  procedure  is 
chosen,  the  result  is  a  partitioning  of  the  image 
into  a  set  of  smaller  image  chips,  each  of  which 
contains  very  few  scattering  centers.  The  seg¬ 
mentation  highlights  an  advantage  estimating- 
parameters  in  the  image  domain:  we  partition 
the  large  problem  of  estimating  a  single  para¬ 
metric  model  of  large  order  to  explain  the  entire 
data  set  into  smaller,  more  tractable  problems 
that  can  be  solved  in  parallel. 

For  each  segmented  image  chip  we  estimate  the 
parameters  of  the  scattering  by  computing  a 
least-squares  fit  of  the  image  domain  model.  For 
independent,  identically  distributed,  zero  mean 
additive  Gaussian  noise,  the  ML  estimate  of  the 
parameters  is  found  by  minimizing  the  squared 
error  between  the  model  and  the  measured  im¬ 
age  domain  data 

J  (©)  =  ^  |image  chip  —  model(0)|^  (14) 

pixels 

where  0  is  a  vector  containing  the  parameters 
to  be  estimated.  An  iterative  optimization  pro¬ 
cedure  is  used  to  minimize  Eqn.  14.  There  are 
many  nonconvex  optimization  procedures  in  the 
literature,  and  we  choose  to  use  the  simplex 
downhill  method.  The  simplex  method  is  desir¬ 
able  because  it  is  numerically  stable  and  does 
not  require  a  gradient  or  Hessian  of  the  cost 
function. 

The  least-squares  cost  function  in  Eqn.  14  is 
nonconvex  with  many  local  minima.  Therefore, 
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parameter  initialization  and  model  order  selec¬ 
tion  [Sabharwal,  et  ah,  1996]  are  very  impor¬ 
tant.  Presently,  model  order  selection  and  the 
detection  of  L  0  is  performed  interactively 
by  a  human  user.  Likewise,  L„  and  (pn  are 
initialized  by  the  user.  Initialization  of  range 
and  cross  range  positions  is  computed  from  lo¬ 
cal  maxima  in  the  image  chip,  while  r.„  and  7.„ 
are  initialized  at  zero  (point  scattering).  For 
a  fixed  parameter  set  0,the  least-squares  cost 
function  J  is  quadratic  in  the  complex  ampli¬ 
tude  parameter  A;  therefore,  the  least-squares 
estimate  of  A  is  computed  noniteratively  using 
a  matrix  pseudo-inverse. 

At  convergence,  the  simplex  downhill  optimiza¬ 
tion  yields  estimates  of  scattering  parameters 
that  describe  the  position,  size,  shape  and  ori¬ 
entation  of  the  scattering  centers  that  comprise 
the  measured  target.  Automation  of  model  or¬ 
der  selection  and  parameter  initialization  is  a 
topic  of  continuing  development,  both  for  our 
proposed  scattering  model  and  for  simpler  point 
scattering  models  [DeGraaf,  1997]. 

5  Statistical  Analysis 

In  this  section  we  investigate  the  statistical 
properties  of  the  parametric  scattering  model 
presented  in  Eqn.  4.  We  use  the  Cramer-Rao 
bound  (CRB)  to  provide  a  lower  bound  on  the 
error  variance  of  the  estimated  model  parame¬ 
ters.  These  bounds  are  algorithm  independent 
and  provide  an  analysis  of  the  uncertainty  in  the 
estimated  parameters  that  describe  the  mea¬ 
sured  scattered  field. 

The  utility  of  the  variance  bounds  is  twofold. 
First,  we  use  the  bounds  to  predict  performance 
of  an  unbiased,  statistically  efficient  estimator. 
Accordingly,  the  uncertainty  in  the  estimated 
parameters  can  be  characterized  as  a  function 
of  system  parameters  such  as  bandwidth,  center 
frequency,  SNR,  frequency  sample  spacing,  and 
scattering  object.  Second,  we  use  the  bounds 
to  evaluate  suboptimal,  but  computationally 
attractive,  estimation  algorithms.  Specifically, 
the  bounds  provide  a  baseline  for  evaluating  the 
trade-off  between  estimation  performance  and 
computation. 


To  derive  bounds  on  estimation  accuracy,  we 
adopt  the  scattering  model  of  Eqn.  4  with  an 
additive  perturbation 

E{k,(j))  =  +‘q{k,(j))  (15) 

n=l 

Here,  r/>)  represents  the  modeling  error 
(background  clutter,  sensor  noise,  model  mis¬ 
match,  incomplete  motion  compensation,  an¬ 
tenna  calibration  errors,  etc.)  and  is  assumed 
to  be  a  white,  Gaussian  noise  process.  Recall 
Ln  and  are  the  length  and  orientation  an¬ 
gle  of  a  distributed  scatterer,  respectively,  while 
Xn  and  Vn  represent  the  down  range  and  cross 
range  position  of  the  n‘'‘  scattering  mechanism. 
For  this  analysis,  the  model  parameters  are  con¬ 
sidered  deterministic.  Further,  the  performance 
predictions  assume  a  parameter  estimator  that 
is  unbiased,  efficient  and  normally  distributed 
(as  is  asymptotically  true  for  the  least-squares 
estimator) . 

The  Cramer-Rao  bounds  are  derived  by  a 
straight-forward  application  of  the  Cauchy- 
Schwartz  inequality.  We  formulate  the  likeli¬ 
hood  function  of  the  scattering  model  under  the 
assumption  of  additive,  independent,  identically 
distributed,  complex-valued  Gaussian  noise;  we 
then  derive  expressions  for  the  partial  deriva¬ 
tives  of  the  logarithm  of  the  likelihood  function 
with  respect  to  the  parameters  of  the  model  and 
the  variance  of  the  noise.  We  use  these  deriva¬ 
tives  and  the  independence  of  noise  samples  to 
obtain  the  Fisher  information  matrix. 

The  Fisher  information  matrix  can  be  com¬ 
puted  for  any  choice  of  scattering  parameters 
and  noise  variance;  the  GRB  covariance  ma¬ 
trix  is  then  found  by  inverting  the  Fisher  in¬ 
formation  matrix.  Finally,  for  any  parameter  in 
the  set  of  scattering  model  parameters,  a  diago¬ 
nal  entry  of  the  GRB  covariance  matrix  gives 
a  lower  bound  on  the  variance  achievable  by 
any  unbiased  estimator.  Off-diagonal  entries  of 
the  CRB  matrix  lower-bound  the  correlation  be¬ 
tween  estimated  parameters. 

In  this  presentation,  we  use  the  CRB  sensitivity 
analysis  to  address  three  performance  issues  re¬ 
lating  SNR,,  bandwidth,  and  center  frequency  to 
parameter  uncertainty;  (1)  resolution  and  ac¬ 


curacy  in  locating  scattering  mechanisms;  (2) 
uncertainty  in  characterizing  the  frequency  de¬ 
pendence  of  scattering;  (3)  accuracy  in  estimat¬ 
ing  the  length  and  tilt  of  a  distributed  scatterer. 
These  three  issues  are  representative  of  the  es¬ 
timation  performance  predictions  accessible  via 
the  statistical  analysis. 

For  each  of  the  examples  presented  below,  we 
consider  a  bandwidth  and  aperture  consistent 
with  1ft  X  1ft  SAR  image  resolution.  Specif¬ 
ically,  we  assume  500  MHz  bandwidth  with 
±1.4°  aperture  at  fc  =  10  GHz  and  ±0.4242° 
aperture  at  fc  =  33  GHz.  Additionally,  exam¬ 
ples  are  computed  for  64  equally  spaced  samples 
in  both  frequency  and  aspect.  Signal-to-noise 
(SNR)  values  are  reported  as  the  ratio  of  sig¬ 
nal  to  noise  energy  computed  in  the  frequency- 
aspect  domain  samples;  interpretation  of  SNR. 
in  the  image  domain  as  a  difference  between 
peak  signal  and  clutter  floor  {i.e.,  after  pulse 
compression)  requires  a  shift  of  36  dB. 

In  Figure  1  we  show  the  95%  confidence  regions 
for  the  location  estimates  of  two  point  scatterers 
separated  by  10\/2  =  14.14cm  {fc  =  10 GHz). 
Adopting  Hamming  weighting  for  side  lobe  sup¬ 
pression,  the  standard  image  resolution  cell  is 
18.10  cm;  however,  resolution  of  a  model-based 
scattering  analysis  is  limited  only  by  sensor 
bandwidth,  signal-to-clutter  ratio,  and  model 
fidelity.  In  the  figure,  the  95%  confidence  re¬ 
gions  are  shown  for  -20  dB  and  4  dB  SNR;  the 
circular  regions  reflect  the  uncorrelated  errors  in 
range  and  cross  range  estimates.  A  convenient 
definition  of  resolution  is  to  require  nonoverlap¬ 
ping  confidence  regions;  here,  coherent  process¬ 
ing  of  the  two-dimensional  {k,  4>)  data  resolves 
the  two  point  scatterers  for  any  SNR  exceeding 
-19  dB.  In  contrast,  one-dimensional  processing 
of  a  single  frequency  scan  requires  ±19  dB  SNR. 
for  14.14cm  resolution  (i.e.,  4(7  separation). 

In  Figure  2  we  show  the  probability  of  correct 
detection  of  the  frequency  dependence  parame¬ 
ter,  a,  for  a  single  scattering  mechanism.  Fig¬ 
ure  2(a)  shows  detection  rate  versus  SNR  for 
1  ft  resolution  X-band  and  K-band  SAR  systems 
{fc  =  10  GHz  and  fc  -  33  GHz).  The  analyti¬ 
cally  derived  detection  results  are  averaged  over 
five  scattering  types  {a  €  {-1, 0,  1}). 
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Figure  1:  Resolution,  as  defined  by  95  per¬ 
cent  confidence  regions  on  estimated 
scattering  center  locations.  Fourier 
resolution  is  18.1  cm. 


Notably,  uncertainty  in  estimating  a  decreases 
drastically  with  an  increase  in  relative  band¬ 
width.  This  finding  confirms  the  intuition  that 
accurate  estimation  of  the  trend  in  scattering 
amplitude  versus  frequency  requires  either  high 
bandwidth  or  low  noise  power.  In  Figure  2(b) 
the  detection  of  a  is  restricted  to  the  binary  hy¬ 
potheses  of  a  =  5  or  a  =  1;  this  represents,  for 
L  ^  0,  the  scenario  of  distinguishing  a  cylinder 
from  a  dihedral. 

In  Figure  3  we  show  the  standard  deviation  of 
parameter  estimates  versus  SNR  for  a  dihedral 
near  broadside  (L  =  5m,  a  =  1,  <j)i  =  0.5°, 
fc  =  10  GHz,  500  MHz  bandwidth).  Observe 
that  the  down  range  and  cross  range  uncertain¬ 
ties  in  the  location  of  the  dihedral  are  not  equiv¬ 
alent.  Appealing  to  intuition  in  the  image  do¬ 
main,  the  nonsymnietric  resolution  is  explained 
in  that  a  sharp  peak  down  range  is  more  eas¬ 
ily  located  than  is  the  center  of  the  broad  cross 
range  response.  As  the  tilt  angle,  (pt,  increases, 
the  difference  between  the  cross  range  and  down 
range  location  uncertainties  decreases.  This  af¬ 
fect  is  easily  understood  by  considering  that  the 
dihedral  response,  measured  off  of  normal  in¬ 
cidence,  is  dominated  by  diffraction  from  the 
two  endpoints  of  the  dihedral.  Thus,  off  of  nor¬ 
mal  incidence,  the  dihedral  response  is  well  ap¬ 
proximated  by  two  individual  scattering  mech- 


Figure  2:  Predicted  probability  of  correctly 
identifying  alpha:  (a)  five  alpha  val¬ 
ues,  (b)  alpha  ^  versus  1 
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Figure  3:  Parameter  uncertainty  versus  SNR 
for  scattering  model,  L  ^  Q. 

anisms,  each  with  symmetric  down  range  and 
cross  range  resolvability. 

6  Examples 

We  present  two  measured  target  examples 
which  illustrate  the  effectiveness  of  our  image 
domain  model  at  compressing  large  measured 
data  sets  into  a  small  set  of  physically  descrip¬ 
tive  parameters.  These  parameters  describe  the 
shape,  position,  and  orientation  of  scattering 
centers  comprising  the  target  response  over  the 
measured  frequency  and  angle  spans. 

First  we  consider  the  scattering  from  a  square 
flat  plate  measured  in  the  Ohio  State  Univer¬ 
sity  ElectroScience  Laboratory  (ESL)  Compact 
Range  [Walton  and  Young,  1984].  We  analyze 
stepped  frequency  measurements  of  the  plate  for 
frequencies  9.5-10.5  GHz  in  20MHz  steps  and 
for  angles  ±3  degrees  (in  0.5  degree  steps)  from 
broadside  to  one  of  the  edges.  The  plate  is  a 
two  foot  square  and  lies  in  the  plane  of  rotation. 
The  measurement  polarization  is  horizontal. 

Figure  4  shows  an  image  of  the  plate.  The  image 
contains  three  scattering  centers.  The  broadside 
response  of  the  edge  of  the  plate  appears  as  a 
line  in  the  image.  The  two  remaining  corners 
on  the  back  of  the  plate  appear  as  point  mecha¬ 
nisms.  These  three  mechanisms  are  segmented 
in  the  image,  and  the  algorithm  of  Section  5  is 
used  to  estimate  the  parameters.  Table  3  shows 
the  estimated  parameters  and  their  actual  val- 


Table  3:  Estimated  scattering  parameters 
for  plate  example;  Fourier  resolu¬ 
tion,  with  Hamming  windowing,  is 
19.2  cm. 


Scatterer 

Attribute 

Estimated 

Actual 

Front  Edge 

length 

0.5920m 

.6096m 

tilt 

-0.6567° 

0 

down  range 

-0.3085m 

-0.3048m 

cross  range 

-0.0014m 

0.0000m 

alpha 

0 

0 

Back  Left 

down  range 

0.3048m 

0.3048m 

Corner 

cross  range 

-0.3157m 

-0.3048m 

alpha 

-1 

-1 

Back  Right 

down  range 

0.2971m 

0.3048m 

Corner 

cross  range 

0.3216m 

0.3048m 

alpha 

-1 

-1 

ues  The  actual  values  are  based  on  the  assump¬ 
tion  that  the  plate  is  exactly  two  foot  square  and 
is  perfectly  aligned  during  radar  measurements 
so  that  zero  degrees  corresponds  to  broadside 
to  an  edge.  The  estimated  tilt  angle  is  approxi¬ 
mately  -0.6  degrees,  which  is  an  indication  that 
the  plate  was  not  exactly  aligned  with  0  de¬ 
grees  broadside  to  the  radar.  Figure  5  shows  the 
amplitude  of  the  scattering  from  the  plate  as  a 
function  of  angle.  Note  that  the  peaks  are  not  at 
zero  degrees  and  90  degrees,  as  we  would  expect 
for  a  perfectly  aligned  target.  The  misalignment 
of  the  target  also  contributes  to  a  small  amount 
of  error  in  the  expected  locations  of  the  three 
scattering  centers. 


Figure  5:  Magnitude  of  Plate  scattering  vs. 

aspect  angle  indicating  target  mis¬ 
alignment 
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Figure  4:  Image  and  estimates  for  plate  example 


The  algorithm  successfully  compresses  the  mea¬ 
sured  data  set  into  a  small  table  of  numbers  de¬ 
scribing  three  distinctive  features  of  the  plate. 
Over  97  percent  of  the  measured  energy  is  mod¬ 
eled  by  the  three  estimated  scattering  centers. 
Thus,  the  scattering  model  provides  a  70:1  lossy 
compression  of  the  original  frequency-aspect 
data  with  a  mean-squared  image  error  (MSE) 
of  three  percent.  The  error  in  the  estimated 
location  of  the  individual  scattering  centers  is 
small,  and  in  each  case  is  less  than  1/10  the 
Fourier  resolution.  The  geometric  type  (a)  es¬ 
timates  correctly  identify  the  edge  specular  and 
corner  diffraction  scattering  behaviors. 

Next  we  consider  a  more  complicated  target, 
namely  a  scale  model  of  an  M35  truck.  Stepped 
frequency  measurements  of  the  1:16  brass  scale 
model  truck  were  collected  in  the  ESL  compact 
range.  As  in  the  first  example,  we  analyze  data 


from  9.5-10.5  GHz  and  ±3  degrees  (from  nor¬ 
mally  incident  on  the  back  of  the  truck).  Fig¬ 
ure  6  shows  the  image  of  the  truck  for  this 
data  set.  We  segment  the  single  most  ener¬ 
getic  scattering  mechanism  in  the  image  and 
fit  the  line  mechanism  [L  7^  0)  model  to  that 
image  chip.  Table  4  shows  the  estimated  pa¬ 
rameters  for  the  model  truck.  The  scattering 
center  is  estimated  to  have  an  alpha  value  of  1 
which  implies  a  specular  surface,  as  is  indeed 
present  on  the  back  of  the  model  truck.  The 
simulated  image  generated  with  the  estimated 
parameters  for  the  single  mechanism  represents 
the  entire  image  with  only  six  percent  MSE  at 
this  frequency  and  angle  span.  Thus  the  model 
provides  a  189:1  lossy  compression  of  the  orig¬ 
inal  frequency- aspect  data  (4600:1  compression 
of  the  SAR  image)  with  over  0.94  correlation  to 
the  original  image. 
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Figure  6:  Image  of  scale  model  M35  truck 


Table  4;  Estimated  scattering  parameters  for 
truck  example 


Scatterer 

Attribute 

Estimate 

Back  of  Truck 

Length 

Tilt 

Down  Range 
Cross  Range 
Alpha 

0.1595m 

-0.4671° 

-0.1502m 

-0.0033m 

1 

7  Conclusion 

We  present  a  GTD-based  parametric  scatter¬ 
ing  model  for  the  extraction  of  scattering  cen¬ 
ters  from  radar  data  measured  as  a  function 
of  frequency  and  aspect  angle.  The  scattering- 
model  balances  physical  fidelity  with  simplicity 
in  functional  form  to  yield  both  smaller  mod¬ 
eling  error  and  a  richer  description  of  scatter¬ 
ing  behavior  when  compared  to  either  Fourier 
imaging  or  point  scattering  models.  Data  anal¬ 
ysis  using  the  proposed  model  has  application  to 
feature  extraction  for  target  identification,  SAR 
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data  compression,  and  scattering  studies.  The 
model  is  developed  in  the  frequency-aspect  do¬ 
main  motivated  by  GTD-based  and  physical  op¬ 
tics  scattering  principles.  We  present  an  image 
domain  estimation  procedure  for  the  model  pa¬ 
rameters,  and  thereby  gain  benefit  of  both  clut¬ 
ter  suppression  and  computational  savings.  We 
derive  Cramer-Rao  bounds  as  tools  for  predict¬ 
ing  uncertainty  in  estimated  parameters.  The 
scattering  model  and  the  image  domain  estima¬ 
tion  algorithm  are  validated  with  two  measured 
data  examples. 
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Abstract 

The  correctness  of  results  of  structural  object  recog¬ 
nition  approaches  largely  depends  on  the  reliability 
of  the  features  extracted  from  the  image  data.  How¬ 
ever,  this  cannot  be  satisfied  in  many  practical  situ¬ 
ations  where  the  applications  require  robust  recogni¬ 
tion  during  day/night  under  high  clutter.  Stochas¬ 
tic  models  provide  some  attractive  features  for  pat¬ 
tern  matching  and  recognition  under  partial  occlu¬ 
sion  and  noise.  In  this  paper,  we  present  a  hidden 
Markov  modeling  (HMM)  based  approach  for  rec¬ 
ognizing  objects  in  synthetic  aperture  radar  (SAR) 
images.  We  develop  multiple  models  for  a  given 
SAR  image  of  an  object  and  integrate  these  mod¬ 
els  synergistically  using  their  probabilistic  estimates 
for  recognition.  The  models  are  based  on  sequen- 
tialization  of  scattering  centers  extracted  from  SAR 
images.  Experimental  results  are  presented  using 
99,000  training  samples  and  81,000  testing  samples 
for  5  classes.  We  achieved  better  than  87%  correct 
recognition  performance  when  the  objects  are  up  to 
30%  occluded. 

1  Introduction 

One  of  the  critical  problems  for  object  recognition  is 
that  the  recognition  process  has  to  be  able  to  handle 
partial  occlusion  of  the  object  and  spurious  or  noisy 
data.  In  most  of  the  object  recognition  approaches, 
the  spatial  arrangement  of  structural  information  of 
the  object  is  the  central  part  that  offers  the  most 
important  information.  Under  partial  occlusion  sit¬ 
uations  the  recognition  process  must  be  able  to  work 
with  only  portions  of  the  correct  spatial  information. 
Rigid  template  matching  and  shape-based  recogni¬ 
tion  approaches  depend  on  good  prior  segmentation 
results.  But  the  structural  primitive  (e.g.,  line  seg¬ 
ments,  point-like  features,  etc.)  extracted  from  oc¬ 
cluded  and  noisy  images  may  not  have  sufficient  re¬ 
liability,  which  will  directly  undermine  the  perfor- 

*This  work  is  supported  by  DARPA  grant  MDA972- 
93-1-0010.  The  contents  md  information  do  not  nec¬ 
essarily  reflect  the  position  or  the  policy  of  the  U.S. 
Government. 


mance  of  those  recognition  approaches. 

We  want  to  suggest  an  object  recognition  mecha¬ 
nism  that  effectively  makes  use  of  all  available  struc¬ 
tural  information.  Based  on  the  nature  of  the  prob¬ 
lems  caused  by  occlusion  and  noise,  we  view  the 
spatial  arrangement  of  structural  information  as  a 
whole  rather  than  view  the  spatial  primitives  indi¬ 
vidually.  Because  of  its  stochastic  nature,  a  hidden 
Markov  model  (HMM)  is  quite  suitable  for  charac¬ 
terizing  patterns.  Its  nondeterministic  model  struc¬ 
ture  makes  it  capable  of  collecting  useful  informa¬ 
tion  from  distorted  or  partially  unreliable  patterns. 
Many  successful  applications  of  HMM  in  speech 
recognition  [1,  2,  3]  and  character  recognition  [4,  5] 
attest  to  its  usefulness.  Thus,  it  is  potentially  an 
effective  tool  to  recognize  objects  with  partial  occlu¬ 
sion  and  noise. 

However,  the  limit  of  traditional  HMMs  is  that  they 
are  basically  one  dimensional  models.  So  how  to  ap¬ 
propriately  apply  this  approach  to  two  dimensional 
image  problems  becomes  the  key.  It  has  been  largely 
an  unsolved  problem.  In  this  paper  we  use  the  fea¬ 
tures  based  on  the  image  formation  process  to  en¬ 
code  the  2-D  image  into  1-D  sequences.  We  use 
information  both  from  the  relative  positions  of  the 
scattering  centers  and  their  relative  magnitude  in 
SAR  images  [6].  In  this  paper  we  address  the  fun¬ 
damental  issues  of  building  object  models  and  using 
them  for  robust  recognition  of  objects  in  SAR  im¬ 
ages. 

1.1  Overview  of  the  approach 

Figure  1  provides  an  overview  of  the  HMM  based 
approach  for  recognition  of  occluded  objects  in  SAR 
imagery.  During  an  off-line  phase,  scattering  centers 
are  extracted  from  SAR  images  by  finding  local  max¬ 
ima  of  intensity.  Both  locations  and  magnitudes  of 
these  peak  features  are  used  in  the  approach.  These 
features  are  viewed  as  emitting  patterns  of  some 
hidden  stochastic  process.  Multiple  observation  se¬ 
quences  based  on  both  the  relative  geometry  and 
relative  amplitude  of  SAR  return  signal  (obtained 
as  a  result  of  the  physics  of  the  SAR  image  forma- 
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Figure  1:  The  HMM-Based  approach  for  recognition 
of  occluded  objects. 

tion  process)  are  used  to  build  the  bank  of  stochastic 
models  to  provide  robust  recognition  in  the  presence 
of  severe  occlusion  and  unstable  features  caused  by 
scintillation  phenomena  (where  some  of  the  features 
may  appear/disappear  at  random  in  an  image).  At 
the  end  of  the  off-line  phase,  hidden  Markov  recog¬ 
nition  models  for  various  objects  and  azimuths  are 
obtained.  Similar  to  the  off-line  phase,  during  the 
on-line  phase  features  are  extracted  from  SAR  im¬ 
ages  and  observation  sequences  based  on  these  fea¬ 
tures  are  matched  by  the  HMM  forward  process  with 
the  stored  models  obtained  previously.  Maximum 
likelihood  decision  is  made  on  the  classification  re¬ 
sults.  Now  the  results  obtained  from  multiple  mod¬ 
els  are  combined  in  a  voting  kind  of  approach  that 
uses  both  the  object,  azimuth  label  and  its  proba¬ 
bility  of  classification.  This  produces  a  rank  ordered 
list  of  classifications  of  the  test  image  and  associated 
confidences. 

1.2  Related  work  and  our  contribution 

There  is  no  published  work  on  object  recognition  us¬ 
ing  HMM  models.  Fielding  and  Ruck  [7]  have  used 
HMM  models  for  spatio-temporal  pattern  recogni¬ 


tion  to  classify  moving  objects  in  image  sequences. 
Rao  and  Mersereau  [8]  have  attempted  to  merge 
HMM  and  deformable  template  approaches  for  im¬ 
age  segmentation.  Template  matching  [9]  and  major 
axis  based  approaches  [10]  have  been  used  to  rec¬ 
ognize  and  index  objects  in  SAR  images,  however, 
they  are  not  suitable  to  recognize  occluded  objects. 
Recently,  invariant  histogram  in  conjunction  with 
template  matching  have  also  been  used  to  recognize 
occluded  objects  in  SAR  images  [11]. 

The  original  contributions  of  this  paper  are: 

•  Hidden  Markov  modeling  approach  commonly 
used  for  recognizing  1-D  speech  signals  is  ap¬ 
plied  in  a  novel  manner  to  2-D  SAR  images  to 
solve  the  occluded  object  recognition  problem. 

•  Multiple  models  derived  from  various  observa¬ 
tion  sequences,  based  on  both  the  geometry  and 
signal  amplitude  are  used  to  capture  the  unique 
characteristics  of  patterns  to  recognize  objects. 

•  Unlike  most  of  the  work  for  model  building 
in  computer  vision,  our  recognition  models  us¬ 
ing  hidden  Markov  modeling  concept  are  based 
on  the  peculiar  characteristics  of  SAR  images 
where  the  number  of  models  used  for  recogni¬ 
tion  is  scientifically  justified  by  the  quantifica¬ 
tion  of  the  azimuthal  variance  in  SAR  images. 

•  Extensive  amounts  of  data  (99,000  training 
samples  and  81,000  testing  samples  obtained 
from  1800  images  generated  by  the  well  known 
XPATCH  SAR  simulator  [12]  that  uses  3-D 
CAD  models  of  objects)  is  used  to  test  the 
approach  for  recognition  of  objects  for  various 
amounts  of  occlusion  (10—50%)  and  good  recog¬ 
nition  performance  is  obtained. 

2  Hidden  Markov  Modeling 
Approach 

It  is  well  known  that  HMM  can  model  speech  signals 
well  [1,  2,  3].  It  is  a  model  used  to  describe  a  doubly 
stochastic  process  which  has  a  set  of  states,  a  set  of 
output  symbols  and  a  set  of  transitions.  Each  tran¬ 
sition  is  from  state  to  state  and  associated  with  it 
are  a  probability  and  an  output  symbol.  The  word 
‘hidden’  means  that  although  we  observe  an  out¬ 
put  symbol,  we  cannot  determine  which  transition 
has  actually  taken  place.  At  each  time  step  t,  the 
state  of  the  HMM  will  change  according  to  a  tran¬ 
sition  probability  distribution  which  depends  on  the 
previous  state  and  an  observation  yt  is  produced  ac¬ 
cording  to  a  probability  distribution  which  depends 
on  the  current  state. 

Formally,  a  HMM  is  defined  as  a  triple  X  —  {A,B,n), 
where  a,y  is  the  probability  that  state  i  transits  to 
state  j,  bij{k)  is  the  probability  that  we  observe  sym¬ 
bol  A:  in  a  transition  from  state  i  to  state  j,  and  tt,- 
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N:  the  number  of  states. 

M:  the  number  of  distinct  observable  symbols. 

A:  a  jj  is  the  probability  that  state  i  will  transit  to  state  j. 

B:  bij(k)  is  the  probability  that  symbol  k  will  be  observed 
when  there  is  a  transition  from  state  i  to  state  j. 
n  :  "i  is  the  probability  that  state  i  is  the  initial  state. 

Figure  2:  A  N  states  forward-type  HMM 


is  the  probability  of  i  being  the  initial  state.  Figure 
2  shows  an  example  of  a  iV  states  HMM. 


_  a,(f  -  l)aijbij{yt)l3j{t) 

c^sAT) 

Now  the  expected  number  of  transitions  from  state  i 
to  state  j  given  yj  at  any  time  is  simply 
and  the  expected  number  of  transitions  from  state 
i  to  any  state  at  any  time  is  .  Then, 

given  some  initial  parameters,  we  could  recompute 
aij,  the  probability  of  taking  the  transition  from 
state  i  to  state  j  as: 


(4) 


Similarly,  6,7  (k)  can  be  re-estimated  as  the  ratio  be¬ 
tween  the  frequency  that  symbol  k  is  emitted  and 
the  frequency  that  any  symbol  is  emitted: 


Recognition  Problem  —  Forward  Procedure:  The 
HMM  provides  us  a  useful  mechanism  to  solve 
the  problems  we  face  for  robust  object  recognition. 
Given  a  model  and  a  sequence  of  observations,  the 
probability  that  the  observed  sequence  was  produced 
by  the  model  can  be  computed  by  the  forward  pro¬ 
cedure  [13].  Suppose  we  have  a  HMM  A  =  {A,  B,  jt) 
and  an  observation  sequence  j/f .  We  define  a<(f)  as 
the  probability  that  the  Markov  process  is  in  state 
i,  having  generated  yj. 

ai{t)  =  0,  when  t=0  and  i  is  not  an  initial  state. 
a,'(f)  =  1,  when  t=0  and  i  is  an  initial  state.  (1) 
ai{t)  =  -  l)ajibji{yt)],  when  t  >  0. 

The  probability  that  the  HMM  stopped  at  the  fi¬ 
nal  state  and  generated  yf  is  aspiT).  After  ini¬ 
tialization  of  a,  we  compute  it  inductively.  At  each 
step  the  previously  computed  a  is  used,  until  the  t 
reaches  T.  asf  (T)  is  the  sum  of  probabilities  of  all 
paths  of  length  T. 

Usually,  a  will  become  too  small  to  be  represented 
in  computer  after  several  iterations.  We  take  the 
logarithm  of  the  a  value  in  the  computation. 

Training  Problem  —  Baum-  Welch  Algorithm:  To 
build  a  HMM  is  actually  an  optimization  of  the 
model  parameters  so  that  it  can  describe  the  ob¬ 
servation  better.  This  is  a  problem  of  training.  The 
Baum- Welch  re-estimation  algorithm  is  used  to  cal¬ 
culate  the  maximum  likelihood  model.  But  before 
we  use  the  Baum- Welch  algorithm,  we  must  intro¬ 
duce  the  counterpart  of  a,  (t)  :  /?i(t),  which  is  the 
probability  that  the  Markov  process  is  in  state  i  and 
will  generate  yj^i- 

Pi{t)  =  0,  when  t=T  and  i  is  not  a  final  state. 

=  1,  when  t=T  and  i  is  a  final  state.  (2) 
I3i{t)  =  Ey[a,y6,y(yt+i)/?j(t -bl)],  when  0  <  t  <  T. 


bij{k)  — 


(5) 


It  can  be  proved  that  the  above  equations  are  guar¬ 
anteed  to  increase  asp{T)  until  a  critical  point  is 
reached,  after  which  the  re-estimate  will  remain  the 
same.  In  practice,  we  set  a  threshold  as  the  ending 
condition  for  re-estimation. 


So  the  whole  process  of  training  a  HMM  is  as  follows: 


1.  Initially,  we  have  only  an  observation  sequence 
yj  and  blindly  set  (A,  B,  n). 

2.  Use  yl  and  (A,  B,  tt)  to  compute  a  and  jd  (equa¬ 
tions  1,  2). 

3.  Use  a  and  0  to  compute  7  (equation  3). 

4.  Use  yf,  (A,  B,  tt),  a,  /?  and  7  to  compute  A  and 
B  (equations  4,  5).  Go  to  step  2. 


A  HMM  is  able  to  handle  pattern  distortions  and  the 
uncertainty  of  the  locally  observed  signals,  because 
of  its  nondeterministic  nature.  However,  a  HMM 
is  primarily  suited  for  sequential,  one-dimensional 
patterns  and  it  is  not  obvious  that  how  a  HMM  can 
be  used  on  2-D  patterns  in  object  recognition.  The 
basic  ideas  to  apply  a  HMM  for  our  purpose  are  (a) 
training  the  HMM  A  by  samples  of  SAR  images  of 
a  certain  object,  and  (b)  recognizing  an  unknown 
object  in  a  given  SAR  image.  These  two  problems 
are  addressed  in  the  following.  The  key  questions 
are  what  we  shall  use  as  observation  data  and  how 
we  get  the  observation  sequences. 


3  Hidden  Markov  Models  for  SAR 
Object  Recognition 


3.1  Extraction  of  Scattering  Centers 


The  probability  of  being  in  state  i  at  time  t  and  state 
j  at  time  t  4-  1  given  observation  sequence  yJ  and 
the  model  A  is  defined  as  follows: 

7ij(t)  =  P{Xt  =  i,  Xi+i  =j\yj) 


Scattering  centers  (location  and  magnitude)  ex¬ 
tracted  from  SAR  images  are  used  to  train  and  test 
models  for  recognition.  We  consider  a  pixel  as  a  scat¬ 
tering  center  if  the  magnitude  of  SAR  return  at  this 
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Figure  3:  Examples  of  scattering  centers  (white 
dots)  extracted  from  SAR  images  at  azimuths 
0°,60®,90°.  (a)  Fred  tank,  (b)  SCUD  launcher  with 
missile  down,  (c)  T72  tank,  (d)  T80  tank,  (e)  Mlal 
tank. 

pixel  is  larger  than  all  its  eight  neighbors.  Figure  3 
shows  some  examples  of  scattering  centers  extracted 
from  SAR  images  (6"  resolution)  of  various  objects 
at  15°  depression  angle  and  azimuths  at  0°,  60°,  and 

90°. 

3.2  Rotation  Variance  of  Scattering 

Centers  and  Representation  of  3-D 
Objects 

Unlike  the  visible  images,  SAR  images  are  extremely 
sensitive  to  slight  changes  in  viewpoint  (azimuth  and 
depression  angle)  and  are  not  affected  by  scale  [14]. 

We  evaluate  the  characteristics  of  scattering  centers 
to  find  out  what  kind  of  location  invariance  exists 
among  scattering  centers.  Figure  4(a)  shows  the  ro¬ 
tation  invariance  for  T72  tank.  The  data  is  obtained 
by  rotating  the  image  at  azimuth  i°  (for  a  fixed  de¬ 
pression  angle)  by  a;°  {  x  from  1  to  10  ),  and  com¬ 
paring  the  rotated  image  with  the  image  of  (i  +  a;)° 
to  see  how  many  scattering  centers  do  not  change 
their  location.  Since  the  object  chip  is  256  x  256 
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Figure  4:  (a)T72  tank  Rotational  Invariance. (b)T72 
tank  Rotational  Invariance  With  1°  Angular  Span. 

pixels,  we  rotate  the  image  with  respect  to  the  cen¬ 
ter  point  (127.5,127.5).  The  distance  measurement 
criteria  “exact  match”  and  “within  one  pixel”  are 
defined  in  the  following: 

'  Xr  exactly  matches  x: 

if  MAX(\x  -  KrI,  I2/7  l/rl)  <  \  pixel 
‘  Xr  and  x  are  within  one  pixel: 

.  if  MAA:(|a:-a;rU2/-yrl)  <  pixel 

Figure  4(a)  shows  the  average  result  for  images  at  all 
the  360  azimuth  angles.  The  top  50  scattering  cen¬ 
ters  are  used  for  each  image.  Figure  4(b)  gives  the 
percentage  of  scattering  center  locations  unchanged 
vs.  azimuth  angle  with  1°  angular  span  for  the  exact 
match  and  within  one  pixel  match.  These  results 
show  that  scattering  centers  for  SAR  images  vary 
greatly  with  relatively  small  changes  of  azimuth  an¬ 
gles.  As  a  result  we  represent  an  object  at  a  given 
depression  angle  by  360  azimuths  taken  in  steps  of 
1°. 

3.3  Extraction  of  Observation  Sequences 

After  the  scattering  centers  are  extracted,  we  need 
to  encode  the  data  into  a  1-D  sequence  as  the  input 
to  a  recognition  model  based  HMM  process.  It  is 
one  of  the  key  factors  which  affects  the  performance 


Figure  5:  Example  of  an  observation  sequence  su¬ 
perimposed  on  an  image  of  T72  tank. 

of  a  HMM  modeling  approach  for  object  recogni¬ 
tion.  There  are  many  ways  to  choose  observation  se¬ 
quences,  but  we  want  to  use  information  from  both 
the  magnitude  and  the  relative  spatial  location  of 
the  scattering  centers  extracted  from  a  SAR  image. 
Also  the  sequentialization  method  should  not  be  af¬ 
fected  by  distortion,  noise,  or  partial  occlusion  and 
should  be  able  represent  the  image  efficiently. 

Based  on  the  above  considerations,  we  employ  two 
approaches  to  obtain  the  sequences. 

•  Sequences  based  on  relative  amplitudes:  0\  = 

{Magnitudei,  Magnitude^, ...,  M agnitudcn} 

•  Sequences  based  on  geometrical  relationship: 

O2  =  {d(l)  2),  d(2,  3), ...,  d{n,  1)}  {length  n) 

O3  =  {d(l,  2),  t/(l,  3), ...,  d(l,n)}  {length  n  —  1) 

O4  =  {d(2,  l),d(2,3),  ...,d(2,n)}  {length  n  —  1) 

O5  =  {d(3, 1),  d(3, 2), ...,  d(3,  n)}  {length  n  —  1) 

where  Magnitudei  is  the  amplitude  of  ith  scattering 
center  and  d{i,j)  is  the  euclidean  distance  between 
scattering  centers  i  and  j.  Figure  5  gives  an  exam¬ 
ple  to  illustrate  how  we  get  the  sequences.  Sequence 
Oi  is  obtained  by  sorting  the  scattering  centers  by 
their  magnitude.  We  label  the  scattering  centers  1 
through  n  in  descending  order.  So  in  this  approach, 
we  do  not  use  the  location  information  and  thus  can 
avoid  the  instability  caused  by  the  error  in  local¬ 
ization  of  scattering  centers.  Sequences  O2  through 
O5  are  obtained  based  on  the  relative  locations  of 
the  scattering  centers.  In  experiments  described  in 
section  4,  we  only  consider  the  top  20  scattering  cen¬ 
ters.  This  is  because  we  expect  that  the  scattering 
centers  with  larger  magnitude  are  relatively  more 
stable  than  the  weaker  ones. 

Since  we  use  discrete  HMMs,  each  element  in  the  se¬ 
quence  should  be  converted  to  an  observation  sym¬ 
bol.  It  is  like  a  label  from  1  to  K  that  represents 
the  symbols  which  can  be  observed  for  a  HMM.  We 
use  the  A-means  algorithm  [15]  to  classify  the  mag¬ 
nitude  values  (or  distance  values)  of  all  the  scatter¬ 
ing  centers  in  the  database  into  K  classes.  Once 
we  know  to  which  class  each  of  the  elements  of  a 
sequence  belongs,  we  label  the  element  with  the  la¬ 
bel  of  its  class.  Thus,  the  sequence  of  magnitude 
values  (or  distance  values)  now  is  changed  to  a  la¬ 
bel  between  1  to  K  which  represents  how  different 
scattering  centers  fall  into  the  different  groups  and 
finally,  for  a  given  sequence,  we  obtain  a  sequence  of 
observation  symbols. 


3.4  Off-line  Training  Phase 

The  procedure  for  building  the  model  base  is  de¬ 
scribed  as  follows: 

1.  Loop  (for  a  given  depression  angle)  lines  2-4  for 
each  object  and  each  azimuth  angle. 

2.  Generate  images  which  simulate  occlusion  with 
scattering  centers  occluded  from  different  direc¬ 
tions  (see  section  4.1). 

3.  Loop  line  4  for  each  image  generated  by  line  2. 

4.  Use  Baum- Welch  algorithm  to  re-estimate  the 
HMM  parameters.  (Exit  3-4  loop  when  there 
is  no  further  change  in  parameter  values.) 

3.5  On-line  Recognition  Phase 

The  recognition  procedure  is  described  as  follows: 

1.  Loop  lines  2-3  for  all  the  testing  observation  se¬ 
quences. 

2.  Loop  line  3  for  all  the  models  in  the  model  base. 

3.  Feed  the  observation  sequence  into  the  model, 
(A,  B,  n)(jv/_.  op,  Use  Forward  algorithm  to 
compute  the  probability  that  this  sequence  is 
produced  by  this  model. 

4.  The  model  with  maximum  probability  of  an  ob¬ 
servation  sequence  is  selected  as  the  best  match. 

4  Experiments 

4.1  Data 

Using  the  well  known  XPATCH  [12]  SAR  simula¬ 
tor,  we  generate  one  set  of  SAR  images  of  5  objects 
(Fred  tank,  SCUD  missile  launcher,  T72  tank,  T80 
tank  and  Mlal  tank,  shown  in  Figure  6.)  at  15° 
depression  angle,  at  each  of  the  azimuth  angles  from 
0°  to  359°.  We  extract  the  20  scattering  centers 
(local  maxima)  with  largest  magnitudes.  In  the  ex¬ 
periments,  since  we  want  to  test  the  performance  of 
our  approach  under  partial  occlusion  and  spurious 
data,  we  simulate  realistic  occlusion  situations  and 
generate  images  for  training  and  testing. 

Simulating  occlusion:  We  consider  the  occlusion  to 
occur  possibly  from  9  different  directions  as  shown  in 
Figure  7.  Scattering  centers  being  occluded  are  not 
available,  moreover,  we  add  some  spurious  data  into 
the  image.  For  instance,  20  scattering  centers  are 
shown  in  each  image  of  Figure  7.  They  are  obtained 
by  removing  4  scattering  centers  from  one  particular 
direction  (simulated  occlusion)  and  adding  4  spuri¬ 
ous  scattering  centers  into  the  image.  The  spurious 
scattering  centers  are  added  based  on  the  following 
rules: 
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(c)  T72  tank 


(d)  T80  tank 


(e)  Mlal  tank 

Figure  6:  Targets. 


Figure  7:  Scattering  centers  of  T72  tank  at  azimuth 
0°,  part  of  scattering  centers  are  occluded  from  a 
particular  direction  (0-8,  left  to  right,  top  to  bot¬ 
tom). 


•  The  location  of  the  scattering  center  is  gener¬ 
ated  as  a  pair  of  random  numbers. 

•  The  magnitude  of  the  scattering  center  depends 
on  a  random  number  r  between  1  and  50.  If  r 
is  between  1  and  20,  we  use  the  magnitude  of 
the  rth  brightest  image  scattering  center  as  the 
magnitude  of  the  spurious  center.  Otherwise, 
we  choose  the  magnitude  of  the  21st  brightest 
scattering  center  if  it  was  not  already  assigned 
to  another  spurious  center.  If  it  was  already 
chosen,  we  will  select  the  magnitude  of  the  first 
unused  scattering  center  (the  22nd,  the  23rd, 
and  so  on). 

Training  Data:  Based  on  the  method  of  simulating 
occlusion  described  above,  we  generate  90  images 
from  the  original  image  (10  samples  for  each  of  9 
directions)  at  5%  occlusion  and  another  90  images 
at  10%  occlusion.  Including  the  original  image,  we 
have  181  images  per  object  per  azimuth  angle  to 
train  multiple  HMM  models.  Then  we  have  a  to¬ 
tal  of  99,000  (5  objects,  360  azimuths,  55  occluded 
images)  samples  for  training. 

Testing  Data:  We  generate  one  image  with  o  scatter¬ 
ing  centers  occluded  (o  =  2, 4,  6,  8  or  10)  from  direc¬ 
tion  d  (d  =  0, 1, ...,  8)  per  azimuth  angle  per  object. 
So  there  are  1800  images  (5  objects  x  360  degrees) 
generated  for  testing  of  occlusion  with  o  scattering 
centers  occluded  from  direction  d.  Thus,  we  have  a 


total  of  81,000  (5  objects,  360  azimuths,  5  different 
occlusions  10%  -  50%,  and  9  directions)  samples  for 
testing. 


4.2  Training  -  Building  Bank  of  HMM 
Models  for  Recognition 


We  performed  experiments  to  choose  the  optimum 
of  number  of  states  and  number  of  symbols  of  the 
HMM.  We  use  data  from  5  azimuth  angles  of  five  ob¬ 
jects  (Fred  tank,  SCUD  missile  launcher,  T80  tank, 
T72  tank,  and  Mlal  tank).  The  results  are  shown 
in  Table  1. 

We  find  that  with  the  increase  in  the  number  of 
states  and  symbols,  recognition  performance  in¬ 
creases.  Considering  both  the  recognition  perfor¬ 
mance  and  the  computation  cost,  we  choose  8  states 
and  32  symbols  as  the  optimal  number  of  states  and 
symbols  for  our  HMM  models.  Figure  8  illustrates 
example  parameters  of  a  5  state,  4  symbol  HMM. 

We  have  1800  (=  360  x  5)  HMM  models.  Further, 
since  we  have  defined  five  kinds  of  observation  se¬ 
quences  for  each  image  (Oi, 02) Os, O4) O5),  we  get 
models  based  on  each  kind  of  observation  sequence. 
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Table  1:  Recognition  rate  of  HMM  with  different 
number  of  states  and  symbols. 

N  -  #  of  states. 

M  -  #  of  symbols. 

R  -  Recognition  rate  %  (top  answer  is  correct). 

I  -  Indexing  rate  %  (correct  answer  is  in  the  top  5). 
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4.3  Testing  Results 

During  testing  phase,  each  of  the  81,000  testing  im¬ 
ages  is  tested  against  all  models  (1800  models:  5 
objects,  each  has  360  models  for  each  azimuth  an¬ 
gle).  If  the  model  with  the  maximum  probability  is 
the  model  which  produced  the  sequence,  we  count  it 
as  one  correct  recognition.  Otherwise,  we  count  it  as 
one  incorrect  recognition.  After  we  get  the  results 
of  scattering  centers  occluded  from  all  9  directions, 
we  average  the  result  and  associate  this  recognition 
performance  with  the  model. 

Figure  9  shows  the  testing  results  for  each  of  the 
five  kinds  of  sequences:  Oi,  O2, ...,  O5  (section  3.3). 
The  top  curve,  a  dotted  line,  is  the  percentage  that 
the  test  case  object  and  pose  is  among  the  top  ten 
recognition  results,  and  the  lower  curve,  in  solid  line, 
indicates  the  percentage  that  the  recognition  result 
with  the  highest  probability  is  the  same  as  the  test 
case  object  and  pose. 

Integration  of  results  from  multiple  sequences:  Since 
not  all  models  based  on  various  sequences  for  a 


Figure  8:  An  example:  parameters  of  a  5  states, 
4  symbols  HMM.  The  number  on  edges  represents 
the  transition  probability,  and  the  vector  associated 
with  each  transition  represents  bij{k).  In  our  case, 
we  use  HMM  with  8  states,  32  symbols 

particular  object  and  azimuth  will  provide  optimal 
recognition  performance  under  occlusion,  noise,  etc., 
we  improve  the  recognition  performance  by  combin¬ 
ing  the  results  obtained  from  all  five  kinds  of  models. 
Before  discussing  the  approach  for  integration,  we 
ask  the  question  that  if  one  testing  image  cannot  be 
recognized  correctly  by  models  based  on  a  particular 
sequence,  say  Oj,  will  it  be  recognized  correctly  by 
models  based  on  other  kinds  of  sequences? 

From  the  testing  results,  we  obtained  Table  2  which 
shows  how  many  incorrect  recognitions,  made  by  us¬ 
ing  models  based  on  sequence  O2,  can  be  correctly 
recognized  (“captured”)  by  models  based  on  other 
sequences.  We  draw  two  curves  (Figure  10(a))  to 
show  the  possible  “upper  bound”  and  “lower  bound” 
of  recognition  rate  we  can  achieve  based  on  the  5 
kinds  of  models.  Wede^nethe  “upper  bound”  as  the 
highest  possible  recognition  performance  that  can  be 
achieved  using  the  5  models  (Oi  to  O5)  considering 
only  the  top  candidate  for  recognition  from  each  of 
the  models.  The  curve  on  the  top  is  obtained  by 
considering  all  5  kinds  of  models,  if  one  of  them  can 
correctly  recognize  the  test  data,  we  count  it  as  a 
correct  recognition.  The  total  number  of  errors  cor¬ 
responding  to  “upper  bound”  are  shown  in  the  7th 
column  of  the  Table.  The  “lower  bound”  or  the  bot¬ 
tom  curve  is  the  worst  recognition  result  out  of  the 
five  models. 

We  have  developed  a  histogram-like  method  shown 
in  Figure  11  to  integrate  the  results  from  models 
based  on  5  different  sequences. 

1.  For  each  test  image,  we  collect  the  ten  highest 
possibilities  in  the  testing  results  corresponding 
to  each  of  the  sequences  Oi,02,  ■■■,  O5. 

2.  A  normalization  is  done  to  the  ten  probabilis¬ 
tic  estimates  corresponding  to  each  of  the  se¬ 
quences.  So  we  have  50  normalized  numbers  for 
each  test  image. 
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Table  2:  Testing  results  for  occluded  object  recognition  using  of  81,000  testing  cases. 
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Figure  9;  Recognition  rate  vs.  percentage  of  occlu¬ 
sion  for  HMM  models  based  on  (a)  Oi,  (b)  O2,  (c) 
O3,  (d)  O4,  and  (e)  Os- 

3.  We  draw  a  histogram  with  probability  vs.  ob¬ 
ject  and  pose  (here  we  combine  object  and  pose 
as  one  parameter) .  This  is  because  because  each 
number  corresponds  to  an  object  and  a  pose 
(the  number  is  the  probability  that  the  test  im¬ 
age  is  the  image  of  that  object  at  that  pose), 

4.  If  the  object  associated  with  the  highest  proba¬ 
bility  in  the  histogram  is  the  same  as  the  ground 
truth,  we  count  it  as  one  correct  recognition. 

The  second  curve  from  the  bottom  in  Figure  10(b) 
is  the  result.  The  corresponding  confusion  matrix 
for  various  amounts  of  occlusion  is  shown  in  Table 
3.  On  the  average,  we  find  80.35%  correct  recog¬ 
nition  performance  when  the  objects  are  occluded 
from  10  -  50%.  The  second  curve  from  the  top  in 
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Figure  10:  (a)  “Upper”  and  “lower”  bound  of 

recognition  rate  vs.  percentage  of  occlusion, 
(b) Performance  of  integrated  models:  using  inte¬ 
grated  models  Oi  to  Os-  The  results  for  recogni¬ 
tion  (Top  1)  and  indexing  (Top  5)  candidates  are 
superimposed  on  the  figure  shown  in  (a). 


Figure  10(b)  is  obtained  by  counting  a  correct  in¬ 
dexing  result  when  the  ground  truth  is  in  the  ob¬ 
jects  associated  with  the  highest  5  probabilities  in 
the  histogram.  For  the  purpose  of  comparison,  we 
have  also  superimposed  the  curves  in  Figure  10(a) 
into  Figure  10(b)  with  “lower/upper”  bounds.  Con¬ 
sidering  the  correct  indexing  answer  in  the  top  5  re¬ 
sponses,  the  average  performance  is  93.3%  for  5  ob¬ 
jects  occluded  from  10%  -  50%.  Thus,  our  method 
of  integration  produces  good  results  in  comparison 
to  “upper  bound”  which  is  95.3%  for  5  objects  for 
10%  —  50%  occlusion. 
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test  image 


Final  recognition  results  based 
on  decreasing  confidence 


Figure  11:  Integration  of  results  by  histogram-based 
method. 

5  Conclusions  and  Future  Work 

We  have  presented  a  novel  conceptual  approach  for 
the  recognition  of  occluded  objects  in  SAR  images. 
The  approach  uses  multiple  HMM  based  models  for 
various  observation  sequences  that  are  chosen  based 
on  the  SAR  image  formation  and  account  for  both 
the  geometry  and  magnitude  of  SAR  image  features. 
Using  99,000  training  samples  and  81,000  testing 
samples,  we  find  86.76%  average  correct  recognition 
performance  on  5  classes  of  objects  with  10%  —  50% 
occlusion.  The  number  of  observation  sequences  and 
the  number  of  features  are  design  parameters  which 
can  be  optimized  by  following  the  approach  pre¬ 
sented  in  the  paper. 

We  have  also  done  some  initial  experiments  for  artic¬ 
ulated  object  recognition  using  HMM  approach.  We 
have  three  sets  of  data;  the  original  images  for  the 
objects  (T72  tank,  T80  tank,  and  Mlal  tank),  the 
images  for  the  objects  with  turret  at  60  degree  artic¬ 
ulation,  and  the  images  for  the  objects  with  turret 
at  90  degree  articulation.  We  compared  the  obser¬ 
vation  sequences  Oi  extracted  from  the  three  sets  of 
images.  Figure  12  shows  the  analysis  graph  for  T72 
tank.  Fignre  12  (al),  (bl),  and  (cl)  are  obtained  by 
counting  the  number  of  observation  symbols  in  ob¬ 
servation  sequence  of  one  image  which  are  the  same 
as  its  corresponding  one  in  observation  sequence  of 
another  image.  Figure  12  (a2),  (b2),  and  (c2)  are 
obtained  by  counting  the  sum  of  differences  between 
observation  symbols  in  observation  sequence  of  one 
image  and  its  corresponding  one  in  observation  se¬ 
quence  of  another  image. 

We  used  two  sets  out  of  three  sets  of  images  as  train¬ 
ing  data  to  train  the  HMM  models,  and  tested  the 
HMM  models  on  the  other  set.  Table  4  shows  the 
results.  These  experimental  results  are  obtained  by 
using  observation  sequence  Oi  only,  the  experiments 
using  other  sequences  O2  through  O5  will  be  done 
in  the  future. 


Table  3:  Confusion  Matrix  for  5  objects  classes  at 
varying  amounts  of  occlusion  (10%  —  50%). 


Fred 

SCUD 

T72 

T80 

HU 

100.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.1 

0.4 

0.3 

Fred 

■■ 

0.2 

0.6 

1.9 

1.4 

40 

87.1 

0.7 

2.8 

5.5 

3.9 

50 

73.2 

1.6 

7.1 

12.1 

6.0 

10 

0.0 

100.0 

0.0 

0.0 

0.0 

20 

0.0 

99.7 

0.2 

0.1 

0.0 

SCUD 

30 

0.9 

97.3 

1.2 

0.4 

0.3 

40 

3.1 

88.8 

4.9 

1.9 

1.3 

50 

5.6 

77.9 

11.9 

2.7 

1.9 

10 

0.0 

0.0 

100.0 

0.0 

0.0 

20 

0.4 

0.2 

99.2 

0.1 

0.2 

T72 

30 

2.4 

0.5 

95.3 

1.1 

0.6 

40 

9.1 

2.1 

82.5 

3.8 

2.4 

50 

16.8 

5.2 

65.9 

6.8 

5.4 

10 

0.0 

0.0 

0.0 

100.0 

0.0 

20 

1.2 

0.0 

0.1 

98.6 

0.1 

T80 

30 

6.9 

0.0 

0.6 

91.1 

1.4 

40 

21.5 

0.1 

1.6 

72.6 

4.2 

50 

37.4 

0.8 

3.1 

50.9 

7.8 

10 

0.0 

0.0 

O.O 

0.0 

100.0 

20 

1.6 

0.0 

0.1 

0.3 

98.0 

Mlal 

30 

8.5 

0.2 

0.7 

2.9 

87.8 

40 

22.5 

0.8 

2.0 

8.5 

66.1 

50 

36.9 

1.1 

5.2 

13.8 

42.9 
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Abstract 

Image  segmentation  represents  an  essential  step 
in  the  early  stages  of  an  Automatic  Target 
Recognition  system.  We  propose  two  robust  ap¬ 
proaches  fundamentally  based  on  scale  informa¬ 
tion  inherent  to  a  given  imagery.  The  first  ap¬ 
proach  is  parametric  in  that  the  scale  evolution 
of  an  image  is  statistically  captured  by  a  model 
which  is  in  turn  utilized  to  classify  the  pixels 
in  the  image.  The  second,  on  the  other  hand, 
nonlinearly  evolves  the  image  along  some  spe¬ 
cific  characteristic  to  unravel  and  delineate  the 
various  comprising  entities  in  it. 

1  Introduction 

The  growing  interest  in  Automatic  Target 
Recognition  (ATR)  is  primarily  due  to  its  im¬ 
portance  in  many  applications  ranging  from 
manufacturing  to  remote  sensing  and  surface 
surveillance.  Analysis  and  classification  of  vari¬ 
ous  entities  of  an  image  are  often  of  great  in¬ 
terest,  and  its  partitioning  into  a  set  of  ho¬ 
mogeneous  regions  or  objects  is  thus  of  impor¬ 
tance.  The  mere  size  of  some  imagery  consti¬ 
tutes  a  major  hurdle.  A  case  in  point  is  Syn¬ 
thetic  Aperture  Radar  (SAR)  imagery  for  which 
terrain  coverage  rates  are  very  high  (in  excess 
of  1  kTV?/s)  and  which  with  daunting  computa¬ 
tional  demands,  make  algorithmic  efficiency  of 
central  importance. 

A  SAR  image  reflects  a  coherent  integration  of 
scatterer  returns  (i.e.  reflectivity  characteris¬ 
tics)  within  a  resolution  cell.  The  number  of 
scatterers  which  coherently  sum  up  within  a 
cell  will  vary  with  the  resolution.  This  leads 

“This  report  describes  research  supported  in  part  by 
DARPA  under  contract  FA49620-93-1-0604. 


to  a  variation  in  the  underlying  statistics  of  the 
image.  Natural  clutter  for  example,  tends  to 
consist  of  a  large  number  of  equivalued  scatter¬ 
ers,  and  this  in  contrast  to  the  man-made  one, 
mostly  comprising  a  few  prominent  scatterers. 
It  is  precisely  this  type  of  statistical  characteris¬ 
tic  that  we  are  interested  in  capturing  and  sub¬ 
sequently  using  as  the  basis  for  our  classification 
of  various  terrain  types  in  SAR.  In  addressing 
such  a  problem,  we  have  two  basic  approaches, 
the  first  aims  at  characterizing  the  statistics  of 
the  image  evolution  in  scale  and  ultimately  us¬ 
ing  them  in  the  pixel  classification;  the  second 
attempts  to  diffuse  “non-complying”  observa¬ 
tions  via  a  nonlinear  evolution  to  make  it  con¬ 
verge  to  specific  desired  domains  of  attraction. 

The  goal  of  this  paper  is  to  address  a  funda¬ 
mental  problem  arising  in  ATR,  namely  the  ro¬ 
bust  and  efficient  segmentation  of  imagery  into 
homogeneous  regions,  and  in  presence  of  per¬ 
haps  severe  noise.  To  proceed,  we  introduce  in 
the  next  section  a  multiscale  stochastic  model¬ 
ing  framework  which  affords  one  to  capture  the 
evolution  in  scale  of  a  given  image  process  and 
to  statistically  characterize  regions  of  interest. 
Two  algorithms  based  on  this  framework  will  be 
described  and  shown  to  lead  to  efficient  and  ac¬ 
curate  segmentation.  In  Section  4,  we  present 
our  second  method  based  on  a  variational  ap¬ 
proach,  and  which  consists  of  nonlinearly  evolv¬ 
ing  a  given  image  along  some  specific  geometric 
constraints  built  around  an  energy  functional. 
Finally,  in  Section  5,  we  provide  a  number  of  ex¬ 
amples  substantiating  the  proposed  algorithms 
using  real  data  imagery  and  accurate  estimates 
of  boundaries. 
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2  Stochastic  Modeling 

2.1  Multiscale  Stochastic  Models 

In  this  subsection,  we  describe  a  general  multi¬ 
scale  modeling  framework  e.g.  [2]  and  its  adap¬ 
tation  to  classification/identification  problems. 
Under  this  framework,  a  multiscale  process  is 
mapped  onto  nodes  of  a  qth  order  tree,  where 
q  depends  upon  how  the  process  progresses  in 
scale. 


As  illustrated  in  Fig.  1,  a  qth  order  tree  is  a 
connected  graph  in  which  each  node,  starting  at 
some  root  node,  branches  off  to  q  child  nodes. 
As  described  above,  the  appropriate  represen¬ 
tation  for  a  multiscale  SAR  image  sequence  is 
q  =  4,  a  quadtree.  Each  level  of  the  tree  (i.e., 
distance  in  nodes  from  the  root  node)  can  be 
viewed  as  a  distinct  scale  representation  of  a 
random  process,  with  the  resolutions  proceeding 
from  coarse  to  fine  as  the  tree  is  traversed  from 
top  to  bottom  (root  node  to  terminal  nodes).  A 
coarse-scale  shift  operator,  7  is  defined  to  ref¬ 
erence  the  parent  of  node  s,  just  as  the  shift 
operator  ^  allows  referencing  of  previous  states 
in  discrete  time-series.  The  state  elements  at 
these  nodes  may  be  modeled  by  the  coarse-to- 
fine  recursion 

x(s)  =  A(s)x(s7)  +  B(s)w(s).  (1) 


at  any  node  s,  the  processes  defined  on  each  of 
the  distinct  subtrees  extending  away  from  node 
s  are  mutually  independent. 

For  pixel  classification  purposes,  a  multiscale 
model  can  be  constructed  for  each  specific  ho¬ 
mogeneous  class.  To  specify  each  model,  it  is 
necessary  to  determine  the  appropriate  coeffi¬ 
cients  in  the  matrices,  A(s)  and  B(s),  and  the 
statistical  properties  of  the  driving  noise,  w(s). 
Once  the  models  have  been  specified,  a  likeli¬ 
hood  ratio  test  can  be  derived  to  segment  the 
imagery  into  the  clutter  classes. 

For  a  binary  classification  problem  (i.e.  requir¬ 
ing  a  binary  hypothesis  test)  each  pixel  in  the 
image  corresponds  to  one  of  two  hypotheses:  the 
pixel  is  part  of  some  texture  {Hg)  or  another 
(Hf).  By  exploiting  the  Markov  property  asso¬ 
ciated  with  the  multiscale  models  for  imagery, 
the  log-likelihood  ratio  test  for  classifying  each 
pixel  can  be  written  as. 


^  =  E.  log  [?^x(5)|/3,  (X(s)  I  I3g)\ 


I 


log  Px(s)|/3/X(s)  I /3;)  ,  (2) 


where  /3(.)  =  (X(s7),  il(.)),  and  Px{s)|/35  ^.nd 
Px(i)|/3/  are  the  conditional  distributions  for  the 
two  hypothesized  models.  In  the  next  subsec¬ 
tion,  we  will  show  that  this  likelihood  test  can 
be  efficiently  computed  in  terms  of  the  distribu¬ 
tions  for  w(s)  under  the  two  hypotheses. 


2.2  Scale- Autoregressive  SAR  Model 


In  this  paper,  we  focus  on  a  specific  class  of 
multiscale  models,  namely  scale-autoregressive 
models  [2]  of  the  form 


In  this  recursion,  A(s)  and  JB(s)  are  matrices  of 
appropriate  dimension  and  the  term  w(s)  repre¬ 
sents  white  driving  noise.  The  matrix  A(s)  cap¬ 
tures  the  deterministic  progression  from  node 
S7  to  node  s,  i.e.,  the  part  of  x(s)  predictable 
from  x(s7),  while  the  term  B(s)w(s)  represents 
the  unpredictable  component  added  in  the  pro¬ 
gression.  An  attractive  feature  of  this  frame¬ 
work  is  the  efficiency  it  provides  for  signal  pro¬ 
cessing  algorithms.  This  stems  from  the  Markov 
property  of  the  multiscale  model  class,  which 
states  that,  conditioned  on  the  value  of  the  state 
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I{s)  =  ai(s)/(s7) -f  02(3)7(57^) -b  ...  + 

aR{s)I{sj^)  +  w{s),  ai{s)  E  M  (3) 

where  m(s)  is  white  driving  noise  and  “s”  is 
a  three-tuple  vector  denoting  scale  (or  resolu¬ 
tion  level),  and  spatial  coordinates  {q,n).  For 
homogeneous  regions  of  texture,  the  predic¬ 
tion  coefficients  (the  ai(s)  in  (3))  are  constant 
with  respect  to  image  location  for  any  given 
scale.  That  is,  the  coefficients,  ai(s), . . .  ,  ai^(s), 
depend  only  on  the  scale  of  node  s  (de¬ 
noted  by  m(s)),  and  thus  will  be  denoted  by 


0-1, m{s),  ■  ■  ■  Furthermore,  the  probabil¬ 

ity  distribution  for  w{s)  depends  only  on  m(s). 
Thus,  specifying  both  the  scale-regression  coef¬ 
ficients  and  the  probability  distribution  for  w{s) 
at  each  scale  completely  specify  the  model. 

Following  the  procedure  of  state  augmentation 
used  in  converting  autoregressive  time  series 
models  to  state  space  models,  we  associate  to 
each  node  s  a  i?-dimensional  vector  of  pixel  val¬ 
ues,  where  R  is  the  order  of  the  regression  in 
(3).  The  components  of  this  vector  correspond 
to  the  SAR  image  pixel  associated  with  node  s 
and  its  first  i?  —  1  ancestors.  Specifically,  we 
define 

x(s)  =  [  I{s)  I{sj)  (4) 

Thus,  for  a  model  of  the  form  (3)  or  equivalently 
(1),  i  in  (2)  can  be  calculated  using 

Px(.'i)|x(s7)  I  ^(^T))  Pu'(.9)(^('®))(^) 

with  PF(.s)  being  the  vector  of  residuals  at  scale 
“s”.  The  identification  of  the  model  for  each 
clutter  class  can  thus  be  obtained  for  each  scale 
rn  by  a  standard  least-squares  minimization. 


^ni 


where 


arg  mm 


E 

s  I  rn(s)=7(t} 

ai,mf(s7)  -  ■  •  •  -  «/?./</ f (•‘>'7^)]  ^]{6) 


Qm  —  [®l,m  Of‘2,m  ■  •  ■  , 

with  R.  being  the  regression  model  order.  For 
most  of  the  results  presented  in  Section  5  a  third 
order  regression  {R  —  3)  for  both  the  grass  and 
the  forest  models  was  chosen.  To  obtain  a  sta¬ 
tistical  characterization  of  the  prediction  error 
residuals  (the  w{s)  in  (3))  of  the  model  at  scale 
m,  we  evaluate  the  residuals  in  prcxlicting  scale 
rn  of  a  homogeneous  test  region.  In  particu¬ 
lar,  we  use  the  am,ft  found  in  (6)  to  evaluate  all 
{w(,s)|7u,(s)  =m}  as  indicated  by  Eq.  (5)  with 
//  specifying  “grass”  or  “forest”. 


3  Model-Based  Classification 
3.1  Residual-Based  Classification 


While  we  could  conceivably  postulate  a  spatial 
random  field  model  for  each  windowed  clutter 


category  to  classify  its  center  pixel,  we  use  to 
advantage  the  efficiency  of  multiscale  likelihood 
calculation  to  base  the  classification  of  each  in¬ 
dividual  pixel  “s”  on  a  surrounding  (2i^  -f  1)  x 
{2K  +  1)  window  Vy(s),  where  the  parameter 
if  is  a  nonnegative  and  judiciously  selected  in¬ 
teger.  While  a  larger  window  provides  a  more 
accurate  classification  of  homogeneous  regions, 
it  also  increases  the  likelihood  that  the  window 
contains  a  clutter  boundary.  Thus,  keeping  the 
window  size  as  small  as  possible  is  also  desir¬ 
able.  As  shown  in  [1]  in  the  next  section,  one 
can  determine  the  tradeoff  between  classifica¬ 
tion  accuracy  and  window  size  by  examining  the 
empirical  distribution  of  £  over  windows  of  var¬ 
ious  size  for  homogeneous  regions  of  grass  and 
forest. 


3.1.1  Statistical  Hypotheses  Test 


Whenever  a  clutter  boundary  is  present  within  a 
test  window  the  validity  of  the  center  pixel  clas¬ 
sification  is  questionable.  This  effect  results  in  a 
classification  bias  near  boundaries.  To  address 
this  problem,  we  devise  a  method  to  detect  the 
proximity  of  grass-forest  boundaries  and  subse¬ 
quently  utilize  a  procedure  to  refine  the  classifi¬ 
cation.  Terrain  boundary  proximity  is  detected 
via  a  simple  modification  of  the  decision  made 
based  on  the  test  statistic  i.  Specifically,  rather 
than  comparing  £  to  a  single  threshold  to  decide 
on  a  grass-or-forest  classification,  we  compare  £ 
to  two  thresholds  a  and  b  as  shown  in  Fig.  2, 
and  where  we  include  a  defer  decision. 


Fig.  2;  Initial  pixel  classification. 


This  is  tantamount  to  refining  the  decision  pro¬ 
cedure  by  considering  smaller  windows  within 
the  original  window  until  a  majority  rule  lifts 
the  ambiguity. 
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a  Classify  as  Grass, 

a  >  I  >  b  Defer  decision 

(possible  boundary  presence), 

Er.l  <  b  Classify  as  Forest. 


3.2  Parameter-Based  Classification 

An  alternative  approach  to  classifying  a  clut¬ 
ter  type  is  to  evaluate  an  f'  distance  between 
the  computed  model  parameters  for  a  window 
region  yV’(s)  and  those  of  a  template  homoge¬ 
neous  region  [3].  For  each  pixel  “s”  (or  equiv¬ 
alently  tree  node)  a  model  for  a  corresponding 
surrounding  window  VV(s)  is  computed  at  sev¬ 
eral  scales  giving  rise  to  what  we  refer  to  as  an 
evolution  vector  statistic, 

y(s)  =  [am(s)>  •  •  •  .  «l]>  (7) 

where  “s”  will  sweep  all  of  the  three-tuple  vec¬ 
tors  [m(s),g,n]  at  the  m(s)^'*  resolution  with 
m(s)  =  1,  •  •  •  ,L-1  and  L  the  cardinality  of  the 
considered  set  of  images.  We  should  note  at  this 
point  that  in  this  approach,  the  DC  component 
in  the  «(.)  is  included.  The  order  of  the  regres¬ 
sion  associated  with  modeling  >V([m(s),5,n]) 
from  its  ancestors  will  vary  with  the  level  m(s) 
as  defined  by  the  function 

Om(s)  =  max(R,m(s)) 

The  regression  vector  a(m(s))  provides  a  statis¬ 
tically  optimal  description  of  the  linear  depen¬ 
dency  of  VV(s)  on  {VV’(s7-^)}j^/^\  and  y(s)  thus 
being  a  measure  of  the  scale  evolution  behavior 
of  the  windowed  region. 


y(s)  in  order  to  classify  node  s  as  a  member 
of  either  a  region  of  grass  or  of  forest.  These 
hypotheses  are  respectively  designated  as  hy¬ 
potheses  Hg  and  Hf.  The  classification  of  pixel 
m{s)  will  depend  only  on  y(s)  and  the  predeter¬ 
mined  likelihoods  p  {y{s)\Hg)  and  p  (y(s)|i?/). 

To  carry  out  a  statistically  significant  hypoth¬ 
esis  test,  we  need  to  specify  the  conditional 
probability  density  for  y  under  each  hypothe¬ 
sis.  To  do  this,  we  extensively  examined  the 
distribution  of  the  evolution  vectors  obtained 
from  a  large  homogeneous  region  of  the  cor¬ 
responding  terrain.  For  both  grass  and  forest 
terrain,  it  turns  out  that  each  component  in 
y  approximately  has  a  Gaussian  distribution. 
We  consequently  make  the  approximation  that 
p{y\Hg)  and  p{y\Hf)  are  AT-variate  Gaussian 
densities.  They  are  then  completely  specified  by 
their  mean  vectors  and  m/  and  their  covari¬ 
ance  matrices  Kg  and  Kf  all  of  which  are  calcu¬ 
lated  from  the  training  data  for  each  hypothesis. 
A  maximum  likelihood  (ML)  detector  is  then 
used  to  classify  each  pixel  (using  the  ML  detec¬ 
tor  assumes  equal  a  priori  probabilities  for  each 
hypothesis  and  a  cost  function  that  penalizes  all 
misclassifications  equally).  In  implementing  the 
ML  detector,  a  threshold  p  is  calculated  from 
the  likelihoods  and  used  in  the  classification  of 
each  pixel  through  its  comparison  to  a  sufficient 
statistic  derived  from  the  evolution  vector.  Be¬ 
cause  p  {y\Hg)  and  p  {y\Hf)  are  assumed  to  be 
Gaussian,  it  is  straightforward  to  compute  the 
threshold  p  and  the  sufficient  statistic  t  (y)  for 
each  evolution  vector  as 

and 


Here  again  the  size  of  the  window  used  to  com¬ 
pute  the  evolution  vector,  is  the  result  of  a 
tradeoff  between  modeling  consistency  and  lo¬ 
cal  accuracy. 

3.2.1  Statistical  Classification 

A  characterization  of  the  evolution  vector  y(s) 
is  necessary  to  carry  out  a  statistically  mean¬ 
ingful  classification  of  various  types  of  terrain. 
Specifically,  a  BHT  is  applied  to  the  evolution 


^'(y)  =  {y-mffK^^iy-mf)- 

{y-rngfKg^{y-mg).  (9) 

The  classification  of  I(s),  denoted  as  C(s),  is 
then  given  by 

H,  if^>qy[,,„i) 

Hf  if,<qy[,,„]). 

The  construction  of  the  evolution  y  and  sub¬ 
sequent  application  of  a  BHT  for  each  s  € 
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SIDE 


{5  I  m{s)  =  m}  thus  provide  a  segmentation 
of  Ijn-  Instead  of  independently  generating  a 
segmentation  for  each  image  resolution  for  all 
s  £  {s  I  m{s)  <  L},  C(s)  can  be  obtained 
by  comparing  rj  to  the  average  of  the  sufficient 
statistics  of  nodes  in  I/,  which  have  s  as  an  an¬ 
cestor,  i.e. 

if  (2-(^)-i)%>^^(y(s)) 

S 

if  (2-w-i)%<X;^(y(s)). 

s 

(11) 


Although  this  does  not  yield  a  sufficient  statis¬ 
tic  for  I(s),  doing  so  is  computationally  more 
efficient  than  calculating  a  sufficient  statistic  as 
in  Eq.  (9)  for  each  node  at  every  level  in  the 
quadtree.  Note  that  the  segmentation  technique 
described  here,  could  easily  be  generalized  to  a 
larger  number  of  terrain  types. 

4  Stabilized  Inverse  Diffusion  Equa¬ 
tions  (SIDEs) 

The  previous  two  techniques  rely  on  modeling  a 
linear  evolution  of  the  observed  imagery  with  an 
ultimate  goal  of  pixel  classification.  In  this  sec¬ 
tion,  we  instead  carry  out  a  nonlinear  evolution 
which  is  driven  by  prescribed  geometric  features 
underlying  the  process/imagery,  and  study  its 
progression. 

Towards  that  end,  we  introduce  a  discontinuous 
force  function,  resulting  in  a  system  of  equations 
that  has  discontinuous  right-hand  side  (RHS). 
As  shown  below,  the  objective  is  to  drive  an 
evolution  trajectory  onto  a  lower-dimensional 
surface  which  clearly  has  value  in  image  anal¬ 
ysis,  and  in  particular  in  image  segmentation. 
Segmenting  a  signal  or  image,  represented  as  a 
high-dimensional  vector  I,  consists  of  evolving 
it  so  that  it  is  driven  onto  a  comparatively  low¬ 
dimensional  subspace,  which  corresponds  to  a 
segmentation  of  the  signal  or  image  domain  into 
a  small  number  of  regions. 


Perona-Miilik 


F(s) 

F(s) 

^  _ 

'V 

K  s 

ii 

Fig  3:  Force  function  for  a  SIDE. 

The  type  of  force  function  of  interest  to  us  here 
is  illustrated  in  Fig.  3(right).  More  precisely, 
we  wish  to  consider  force  functions  F{v)  which, 

in  addition  to  driving  the  following  evolutions, 

i{s)  =  E(I)(s),  (12) 

1(0)  =  lo, 

where  “s”  is  now  a  continuous  scale  and  a  “ 
dot  ”  denotes  differentiation  with  respect  to  it, 
satisfy  the  following  properties: 

F'{v)  <0  forn  yi  0,  and  E(0+)  >  0, 

F{vi)  =  F{v2)  4^  Vi  =  V2.  (13) 

Contrasting  this  form  of  a  force  function  to  the 
Perona-Malik  function  [4]  in  Fig.  3  (left),  we  see 
that  in  a  sense,  one  can  view  the  discontinuous 
force  function  as  a  limiting  form  of  the  continu¬ 
ous  force  function  in  Fig.  3  (left).  We,  however, 
need  a  special  definition  of  how  the  trajectory 
of  our  evolution  proceeds  at  the  discontinuity 
points  F(0‘*')  7^  F{0~).  For  this  definition  to  be 
useful,  the  resulting  evolution  must  satisfy  well- 
posedness  properties:  the  existence  and  unique¬ 
ness  of  solutions,  as  well  as  stability  of  solutions 
with  respect  to  the  initial  data.  We  address  the 
issue  of  well-posedness  and  other  properties  in 
[5]. 

Considering  the  evolution  (12)  with  F{v)  as  in 
Fig.  3(right)  in  a  SIDE,  notice  that  the  RHS 
of  (12)  has  a  discontinuity  at  a  point  /(.)  if  and 
only  if  li  =  for  some  i  between  1  and  N  —  1. 
It  is  when  a  trajectory  reaches  such  a  point 
that  we  need  the  following  definition: 

ii  =  ii+i  =  \[[F{ii+2  -  h+i)  -  F{ii  -  /i_i)). 

(14) 

In  other  words,  the  two  observations  are  simply 
merged  into  a  single  one,  resulting  in  Eq.  14  for 
n  =  i  and  n  =  i  +  1  (the  differential  equations 
for  n  7^  f,  i  -b  1  do  not  change.). 


Hg 

Hf 
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Similarly,  if  q  consecutive  observations  become 
equal,  they  are  merged  into  one  which  is 
weighted  by  l/q  [5].  The  evolution  can  then 
naturally  be  thought  of  as  a  sequence  of  stages: 
during  each  stage,  the  right-hand  side  of  (12) 
is  continuous.  Once  the  solution  hits  a  discon¬ 
tinuity  surface  of  the  right-hand  side,  the  state 
reduction  and  re-assignment  of  described 

above,  takes  place.  The  solution  then  proceeds 
according  to  the  modified  equation  until  the 
next  discontinuity  surface,  etc. 

Notice  that  such  an  evolution  automatically 
produces  a  multiscale  segmentation  of  the  origi¬ 
nal  signal  if  we  view  each  compound  observation 
as  a  region  of  the  signal.  The  algorithm  may  be 
be  summarized  as  follows: 

1.  Start  with  the  trivial  initial  segmentation: 
each  sample  is  a  distinct  region. 

2.  Evolve  (12)  until  the  values  in  two  or  more 
neighboring  regions  become  equal. 

3.  Merge  the  neighboring  regions  whose  val¬ 
ues  are  equal. 

4.  Go  to  step  2. 

The  same  algorithm  can  be  used  for  2-D  images, 
which  is  immediate  upon  re-writing  Eq.  (12) 
and  an  example  is  provided  next. 

5  Experiments 

A  number  of  segmentation  experiments  have 
been  carried  out  on  SAR  imagery  using  the 
three  techniques  described  above.  Fig.  4(b) 
shows  a  segmentation  as  a  result  of  a  systematic 
pixel  classification  as  described  in  Technique  1. 
A  brute  force  as  well  as  a  refined  segmentations 
are  shown,  and  the  hand  drawn  of  the  eye-balled 
boundary  is  shown  in  white.  In  Fig.  4(c),  the 
model-based  segmentation  is  illustrated  and  in 
Fig.  5  the  nonlinear  evolution-based  segmenta¬ 
tion  is  shown  which  clearly  is  perhaps  the  most 
promising,  albeit  at  a  slightly  higher  computa¬ 
tional  cost. 


Fig.  4:  Residual  and  Model-Based  Segmentation 
(b  and  c). 


Fig.  5:  Segmentation  by  Nonlinear  Diffusion. 

References 

[1]  C.  Fosgate,  H.  Krim,  W.  Irving,  W.  Karl, 
and  A.  Willsky.  Multiscale  segmentation 
and  anomaly  enhancement  of  SAR  imagery. 
IEEE  Trans,  on  Im.  Proc.,  6(l):7-20,  Jan. 
96. 

[2]  W.W.  Irving.  Multiresolution  approach  to 
discriminating  targets  from  clutter  in  SAR 
imagery.  In  Proc.  of  SPIE  Symp.,  Orlando, 
FL.,  1995. 

[3]  A.  Kim  and  H.  Krim.  Hierarchical  stochas¬ 
tic  modeling  of  SAR  imagery  for  segmenta¬ 
tion/compression.  submitted  to  special  issue 
of  IEEE  Trans,  on  SP,  1996. 

[4]  P.  Perona  and  J.  Malik.  Scale-space  and 
edge  detection  using  anistropic  diffusion. 
IEEE  Trans.  Pattern  Analysis  and  Machine 
Intelligence,  12,  1990. 

[5]  I.  Poliak,  A.S.  Willsky,  and  H.  Krim.  Image 
segmentation  and  edge  enhancement  with 
stabilized  inverse  diffusion  equations.  To  be 
submitted  to  IEEE  Trans,  on  Image  Process¬ 
ing. 


1134 


Invariants  for  the  Recognition  of 
Articulated  and  Occluded  Objects  in  SAR  Images 


Grinnell  Jones  III  and  Bir  Bhanu* 

College  of  Engineering,  University  of  California,  Riverside,  CA  92521 
{grinnell,  bhanu}  ©constitution. ucr.edu 
URL:  http://constitution.ucr.edu 


Abstract 

Articulation  invariant  features  are  found  (and 
quantified)  in  Synthetic  Aperture  Radar  (SAR)  im¬ 
ages  of  military  vehicles.  They  are  used  in  the  devel¬ 
opment  of  a  SAR  recognition  engine  that  successfully 
identified  articulated  objects  based  on  non-articulated 
recognition  models.  The  engine  also  achieves  robust 
recognition  performance  with  mostly  spurious  data 
from  noise  or  highly  occluded  objects.  Performance 
results  are  related  to  the  percent  of  invariant  or  unoc¬ 
cluded  features. 

1  Introduction 

1.1  Problem  Definition  and  Scope 

Automated  object  recognition  in  SAR  imagery  is 
a  significant  problem  because  recent  developments  in 
image  collection  platforms  will  soon  produce  far  more 
imagery  (terabytes  per  day  per  aircraft)  than  the  de¬ 
clining  ranks  of  2000  image  analysts  are  capable  of 
handling  [10].  The  specific  challenges  of  this  research 
are  to  address  the  need  for  automated  recognition  of 
military  vehicles  that  can  be  in  articulated  configu¬ 
rations  (such  as;  tanks,  where  the  turret  can  rotate 
and  the  SCUD  missile  launcher,  where  the  missile  can 
erect)  and  can  be  partially  hidden.  Previous  recog¬ 
nition  methods  involving  template  matching  [11]  are 
not  useful  in  these  cases,  because  articulation  or  oc¬ 
clusion  changes  global  features  like  the  object  outline 
and  major  axis.  In  this  paper  the  problem  scope  is  the 
recognition  subsystem  itself,  starting  with  SAR  chips 
of  target  vehicles  and  ending  with  the  vehicle  identifi¬ 
cation.  Because  the  very  high  resolution  SAR  target 
chips  are  not  openly  available,  the  U.S.  Air  Force  pro¬ 
vided  the  XPATCH  high  frequency  radar  signature 
prediction  code  [1],  which  is  used  to  construct  4320 
target  chips  for  this  research. 

1.2  Overview  of  Approach 

Our  approach  to  object  identification  is  specifically 
designed  for  SAR.  The  peaks  (local  maxima)  in  radar 
return  are  related  to  the  physical  geometry  of  the  ob¬ 
ject.  The  relative  locations  of  these  scattering  cen¬ 
ters  are  independent  of  translation  and  serve  as  dis¬ 
tinguishing  features.  The  specular  radar  return  varies 

‘Supported  in  part  by  grants  MDA972-93-1-0010  and 
DAAH04-95-1-0488;  the  contents  and  information  do  not  reflect 
positions  or  policies  of  the  U.S.  Government. 
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Figure  1:  SAR  recognition  engine  architecture. 

greatly  with  the  uncontrolled  target  orientation  (az¬ 
imuth)  .  This  azimuthal  variance  is  captured  by  using 
360  azimuth  models.  (The  radar  depression  angle  to 
the  target  is  controllable,  or  known,  and  it  is  fixed  at 
15®  for  this  study).  Useful  articulation  invariants  are 
found,  which  permit  building  non-articulated  recogni¬ 
tion  models  and  using  them  to  successfully  recognize 
articulated  targets.  The  SAR  recognition  engine.  Fig¬ 
ure  1,  has  an  off-line  model  construction  phase  and  an 
on-line  recognition  process.  The  recognition  model  is  a 
look-up  table  that  relates  the  relative  distances  among 
the  scattering  centers  (in  the  radar  range  and  cross 
range  directions)  to  object  type  and  azimuth.  The 
recognition  process  is  an  efficient  search  for  positive 
evidence,  using  relative  locations  of  scattering  centers 
to  access  the  look-up  table  and  generate  votes  for  the 
appropriate  object  (and  azimuth). 

1.3  Related  Work  and  Our  Contribution 

A  comparison  of  this  approach,  for  the  articulated 
and  occluded  object  recognition  problems  in  SAR, 
with  related  work  is  given  in  Table  1.  Our  approach  is 
designed  specifically  for  SAR,  but  is  related  to  geomet¬ 
ric  hashing  [8] .  Scattering  center  relative  positions  are 
used  as  SAR  recognition  features.  Template  matching 
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Table  1;  Related  work  comparisons. 


Area 

Related  Approach 

This  Approach 

Indexing: 

Geometric  Hashing  [8J: 

SAR  specific: 

•  transformation 

translation 

translation 

scale 

scale  fixed  in  SAR 

rotation 

azimuth  models 

♦  bin  type/size 

real /varies 

integer/fixed 

SAR  Recognition: 

Template  Matching  [11]: 

Peak  Locations: 

•  azimuth  models 

12  blobs 

360  constellations 

•  reference  frame 

global 

local 

Occlusion: 

Invariant  Histogram  [7j: 

Recognition  Engine: 

#  azimuth  models 

36  (10°  apart) 

360  (1®  apart) 

•  bin  size 

2-4  feet 

6  inches 

•  near  neighbor? 

yes 

no 

•  search 

exhaustive  (Lj  norm) 

voting 

Articulation 

Constraint  Models  [2]  [6] 

Invariants 

[11]  is  not  suitable  for  recognizing  articulated  and  oc¬ 
cluded  objects  since  there  will  be  a  combinatorial  ex¬ 
plosion  of  the  number  of  templates  with  varying  artic¬ 
ulations  and  occlusions.  The  SAR  recognition  engine 
presented  here  has  sufficient  precision  to  perform  both 
indexing  and  matching  functions,  while  the  invariant 
histogram  technique  (that  has  been  applied  to  recog¬ 
nize  occluded  objects  [7])  has  limited  performance,  is 
only  capable  of  indexing  object  models  and  requires  a 
separate  template  matching  step.  Constrained  models 
of  parts  and  joint  articulation,  used  in  optical  images 
[2]  [6],  are  not  appropriate  for  the  relatively  low  resolu¬ 
tion,  non-literal  nature  and  complex  part  interactions 
of  SAR  images;  which  are  handled  by  using  articula¬ 
tion  invariants  as  discussed  in  this  paper.  The  major 
contributions  of  this  paper  are: 

1.  Identifies  and  quantifies  articulation  invariants. 

2.  Demonstrates  a  SAR  recognition  engine  with  ro¬ 
bust  performance  for  articulated  and  occluded  ob¬ 
jects. 

3.  Relates  performance  with  invariance  of  features. 

4.  Quantifies  azimuthal  variance. 

2  SAR  Scattering  Centers 

The  relative  locations  of  peaks  in  the  radar  return 
are  characteristic  features  that  are  related  to  the  ge¬ 
ometry  of  the  object.  The  typical  detailed  edge  and 
straight  line  features  of  man-made  objects  in  the  vi¬ 
sual  world,  do  not  have  good  counterparts  in  SAR  im¬ 
ages  for  sub-components  of  vehicle  sized  objects  at  six 
inch  to  two  foot  resolution.  The  amplitude  map.  Fig¬ 
ure  2,  of  a  typical  SAR  target  (SCUD  launcher  with 
missile  erect,  18°  azimuth,  15°  depression)  at  six  inch 
resolution  shows  a  wealth  of  peaks  corresponding  to 
scattering  centers  and  has  no  obvious  lines  or  edges 
within  the  boundary  of  the  vehicle.  The  4320  target 
chips  for  the  T72,  T80,  Mlal,  FRED  tanks  and  the 
SCUD  missile  launcher  have  a  range  of  52  to  284  lo¬ 
cal  peaks.  The  locations  of  the  peaks  are  related  to 
the  aspect  and  detailed  geometry  of  the  object.  For 
example,  for  the  T72  tank  model,  the  strongest  re¬ 
turns  (that  persist  for  20°  or  more  in  azimuth  span) 


Figure  2:  Example  SAR  image  amplitude  map. 

are  from  four  trihedral  corners  on  the  upper  rear  deck 
of  the  tank  hull  [3].  Figure  3  shows  target  geometries 
(model  sizes  in  increasing  order:  T80,  Mlal,T72  and 
SCUD  launcher).  The  tank  turret  angle  is  measured 
counter-clockwise  from  the  hull  forward  centerline. 

2.1  Azimuthal  Variance 

The  typical  rigid  body  rotational  transformations 
for  viewing  objects  in  the  visual  world  do  not  apply 
much  for  the  specular  radar  reflections  of  SAR  images, 
because  significant  numbers  of  features  do  not  typi¬ 
cally  persist  over  a  few  degrees  of  rotation.  Averaging 
the  results  for  360  azimuths  of  the  T72  tank,  only 
about  one-third  of  the  50  strongest  scattering  center 
locations  (in  object  centered  coordinates)  rernain  un¬ 
changed  (i.e.  within  an  error  radius  of  1/2  pixel)  for 
a  1°  azimuth  change  (see  Figure  4)  and  less  than  5% 
persist  for  10°.  These  are  significantly  less  than  the 
one  foot  resolution  IS  AR  results  of  Dudgeon  [5] ,  whose 


(a)  T72  turret  60°  (b)  T80  turret  60° 


(c)  Mlal  turret  90°  (d)  SCUD  mis- 

sile  launcher 

Figure  3:  Articulated  objects  (not  to  scale). 
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Top  50  scattering  centers  for  T72  tank  (360  azimuths) 


Figure  4;  T72  azimuthal  invariance. 


(a)  two  foot  (b)  six  inch 

Figure  5:  SAR  image  resolution  examples. 


definition  of  ‘persistence’  allowed  scintillation  (i.e.,  a 
point  was  required  to  be  present /absent  for  2  consec¬ 
utive  angles,  1°  apart,  to  appear/disappear,  thus  a 
feature  point  would  be  ‘persistent’  if  it  appears  and 
then  disappears  in  images  separated  by  1°).  Because 
of  the  presence  of  azimuthal  variation  and  the  goal  of 
recognizing  articulated  and  occluded  objects,  in  this 
research  we  use  360  azimuth  models  (!'’  intervals),  in 
contrast  to  others  who  have  used  6®  [9]  and  10®  [7] 
intervals  and  12  models  [11]. 

2.2  Image  Resolution 

A  SAR  image  is  formed  by  collecting  backscattered 
field  over  a  frequency  band  and  over  an  angular  span 
of  incident  directions.  The  resolution  and  scale  of 
objects  are  fixed  by  the  operating  parameters  of  the 
radar  beam:  frequency,  frequency  bandwidth  and  an¬ 
gular  span.  Six  inch  resolution  X-band  images  (10.0 
GHz  center  frequency,  1.0  GHz  bandwidth,  5.6®  an¬ 
gular  span)  are  used  to  provide  a  rich  feature  set  to 
facilitate  the  task  of  recognizing  articulated  and  oc¬ 
cluded  targets.  This  provides  16  times  as  many  pixels 
as  two  foot  resolution.  The  comparison  of  the  two 
foot  resolution  target  ‘blobs’  with  the  six  inch  resolu¬ 
tion  constellation  of  image  points  is  shown  in  Figure 
5  for  the  FRED  tank. 


(a)  missile  down 


(b)  missile  up 


Figure  6:  SCUD  launcher  articulation  example. 


(a)  straight  turret  (b)  —60°  turret 


Figure  7:  T72  articulation  example. 

3  Articulation  Invariants 

The  existence  of  articulation  invariants  in  six  inch 
resolution  SAR  imagery  can  be  seen  in  Figures  6  and  7, 
where  the  locations  of  scattering  centers  are  indicated 
by  the  black  squares.  In  the  example  of  the  SCUD 
launcher,  with  the  radar  directed  (from  the  left  in  Fig¬ 
ure  6)  at  the  front  (cab  end)  of  the  launcher,  many  of 
the  details  from  the  launcher  itself  are  independent 
of  whether  the  missile  is  erect  or  down.  In  the  similar 
view  of  the  T72  tank,  many  of  the  details  from  the  hull 
are  independent  of  the  turret  angle.  An  example  of  ar¬ 
ticulation  invariance  is  shown  in  Figure  8,  which  plots 
the  percentage  of  the  strongest  50  scattering  centers 
for  the  T72  tank  that  are  in  exactly  the  same  location 
with  the  turret  rotated  60°  as  they  are  with  the  turret 
straight  forward,  for  each  of  360  azimuths.  The  mean 
fi,  standard  deviation  a  and  n  —  a  values  of  the  aver¬ 
age  percent  articulation  invariance  (for  360  azimuths) 
is  shown  in  Table  2  for  the  individual  articulated  ob¬ 
jects  and  the  overall  average.  Comparing  the  cases  for 
25  and  50  scattering  centers,  the  mean  values  are  sim¬ 
ilar,  but  the  fj.  —  a  values  are  consistantly  less  for  the 
25  scatterer  cases.  The  smaller  average  articulation 
invariance  for  the  Mlal  tank  is  expected,  because  the 
Ml  tank  has  a  comparatively  much  larger  turret  than 
the  other  tanks  (see  Figure  3). 

4  SAR  Recognition  Engine 

4.1  Local  Reference  Coordinate  System 
and  Translational  Invariance 

Establishing  an  appropriate  local  coordinate  refer¬ 
ence  frame  is  critical  to  reliably  identifying  objects 
(based  on  locations  of  features)  in  SAR  images  of  ar- 
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Invariance  (percent) 


T72  tank  with  60  degree  turret  (50  scattering  centers) 


Figure  8:  Articulation  invariants  example. 


Table  2:  Articulation  invariance  percentages. 
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ticulated  and  occluded  objects.  The  object  articula¬ 
tion  and  occlusion  problems  require  the  use  of  a  local 
coordinate  system;  global  coordinates  and  global  con¬ 
straints  do  not  work,  as  illustrated  in  Figures  6  and  7 
where  the  center  of  mass  and  the  principal  axes  change 
with  articulation.  In  the  geometry  of  a  SAR  sensor  the 
‘squint  angle’,  the  angle  between  the  flight  path  (cross- 
range  direction)  and  the  radar  beam  (range  direction), 
can  be  known  and  flxed  at  90'’ .  Given  the  SAR  squint 
angle,  the  image  range  and  cross-range  directions  are 
known  and  any  local  reference  point  chosen,  such  as  a 
scattering  center  location,  establishes  a  reference  coor¬ 
dinate  system.  (The  scattering  centers  are  local  max¬ 
ima  in  the  radar  return  signal.)  The  relative  distance 
and  direction  of  the  other  scattering  centers  can  be 
expressed  in  radar  range  and  cross-range  coordinates, 
and  naturally  tessellated  into  integer  buckets  that  cor¬ 
respond  to  the  radar  range/cross-range  bins.  For  the 
examples  shown  in  Figures  6,  7  and  10  -  13  range  is 
to  the  right  (a;  axis),  cross-range  is  up  {y  axis).  The 
recognition  engine  takes  advantage  of  this  natural  sys¬ 
tem  for  SAR,  where  a  single  basis  point  performs  the 
translational  transformation  and  fixes  the  coordinate 
system  to  a  ‘local’  origin.  For  ideal  data,  picking  the 
location  of  the  strongest  scattering  center  as  the  refer¬ 
ence  point  is  sufficient.  However,  for  potentially  cor¬ 
rupted  data  where  any  feature  point  could  be  spurious 
or  missing  (due  to  the  effects  of  noise,  articulation, 
occlusion,  non-standard  configurations,  etc.),  the  pro¬ 
cess  needs  to  continue  with  other  scattering  centers  as 
the  reference  point  to  ensure  a  valid  feature  point  is 


obtained  as  the  origin.  Although  this  implementation 
used  all  the  scattering  center  locations  in  turn  as  ref¬ 
erence  points,  heuristics  could  be  applied  to  use  fewer 
reference  points  for  increased  efficiency. 

4.2  Recognition  System  Architecture 

The  basic  system  architecture.  Figure  1,  is  an  off¬ 
line  model  construction  process  and  a  similar  on-line 
recognition  process.  The  approach  is  based  on  local 
features  and  local  reference  coordinate  systems.  A 
systematic  method  is  employed  for  constructing  recog¬ 
nition  models  of  objects  that  are  not  articulated,  the 
models  are  stored  in  a  look-up  table  and  then  local  im¬ 
age  features  are  used  to  index  into  the  look-up  table  of 
models  and  recognize  the  same  objects  in  articulated 
configurations.  The  image  features  used  are  the  po¬ 
sitions  of  the  scattering  centers  (local  maxima  in  the 
signal  strength).  The  number  of  scattering  centers 
used  is  a  design  parameter  that  is  optimized  based 
on  experimental  results.  The  positions  of  the  scat¬ 
tering  centers  are  expressed  in  relative  distances  in 
the  known  SAR  range  and  cross-range  coordinates. 
The  model  construction  technique  extracts  these  rel¬ 
ative  distances  of  the  scattering  centers  from  the  non- 
articulated  training  data  for  all  360  azimuths  for  each 
target  type.  An  example  relative  distance  distribution 
for  the  SCUD  launcher  with  the  missile  up  is  shown 
in  Figure  9  (with  the  distances  shifted  by  154  range 
and  99  cross  range  to  make  the  values  non-negative) . 
The  model  database  is  basically  a  table  that  relates 
these  distances  to  object  labels  (target  type  and  az¬ 
imuth).  The  bounds  of  the  table  indices  (and  the 
shift  amounts)  are  dictated  by  the  relative  distances 
between  scattering  centers  of  the  largest  target  (the 
SCUD  launcher  with  the  missile  up,  establishes  a  308 
range  by  198  cross-range  table  and  the  154  range,  99 
cross  range  shift  for  all  the  data).  In  the  recognition 
phase,  the  relative  positions  of  scattering  centers  are 
obtained  from  the  articulated  (or  occluded)  test  data. 
These  relative  distances  are  indices  into  a  look-up  ta¬ 
ble  which  provides  the  Eissociated  object  label (s)  that 
are  used  to  accumulate  evidence  for  target  identifica¬ 
tion.  The  process  is  repeated  with  different  scattering 
centers  as  reference  points,  providing  multiple  ‘looks’ 
at  the  database.  The  target  type  and  azimuth  angle 
pair  with  the  most  ‘votes’  is  chosen  as  the  answer. 

4.3  Algorithms  and  Complexity 

A  single  test  data  or  model  (object,  azimuth  angle) 
instance  i,  with  M  features  is  represented  by  relative 
range  (r)  and  cross  range  (c)  distance  distributions 
given  by: 

“(dc)  =  |J  (1) 

where  A:  =  1  for  the  models  and  <  M{M  —  l)/2 
(duplicates  are  removed) ,  while  for  the  test  data  A;  >  1 
and  '^k  —  M{M  -  l)/2  (duplicates  are  allowed).  In 
the  recognition  phase  an  object,  azimuth  model  in¬ 
stance  i  gets  votes,  V),  as  the  intersection  of  the  test 
data  distance  distribution  and  the  distance  distribu- 
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Figure  9:  Relative  distance  distribution  for  SCUD 
launcher  with  the  missile  up. 

tion  Oi  for  model  i: 

Vi  =  5Zm“test(^c)ai(r,c).  (2) 

r  c 

The  rule  for  object  identification  I  is  given  by: 

/  =  max{Vi}  for  all  models  (i  =  1  to  360  N);  (3) 

where  N  is  the  number  of  objects.  The  recognition 
engine  implements  equations  (  1)  to  (  3)  as  a  look-up 
table  and  a  decision  rule. 

The  basic  steps  of  the  off-line  model  construction 
algorithm  are: 

1.  For  each  of  N  Objects  do  2: 

2.  For  each  of  360  Azimuth  angles  do  3,4: 

3.  Obtain  the  (row,  column)  locations  of  the 
strongest  M  peaks  from  image(Object,  Azimuth), 

4.  For  each  of  M  Origins  do  5: 

5.  For  each  Point  from  Origin  -|-  1  to  M  do  6,7: 

6.  Find  the  relative  Range  and  Cross-range  positions 
of  the  Point  from  the  Origin, 

7.  Add  a  new  table  entry  at  (Range,  Cross-range) 
with  Object,  Azimuth  label  (add  to  a  list  if  table 
already  occupied). 

The  model  construction  algorithm  complexity  is 
360  N  M  (M-l)/2,  where  N  is  the  number  of  target 
types  and  M  is  the  number  of  peaks  used.  The  on-line 
recognition  algorithm  steps  are: 

1.  Obtain  the  (row,  column)  locations  of  the 
strongest  M  peaks  from  test  image. 

2.  For  each  of  M  Origins  do  3: 

3.  For  each  Point  from  Origin  -f  1  to  M  do  4,5,6: 


4.  Find  the  relative  Range  and  Cross-range  positions 
of  the  Point  from  the  Origin, 

5.  Look  up  the  list  of  table  entries  at  (Range,  Cross- 
range), 

6.  Traverse  the  list;  reading  labels  and  incrementing 
Object,  Azimuth  label  accumulators. 

7.  At  completion,  select  the  Object,  Azimuth  label 
with  the  largest  accumulated  total. 

The  recognition  algorithm  makes  M  (M  -  iy2 
queries  to  the  lookup  table,  where  M  is  the  number 
of  peaks  used.  The  only  models  associated  with  a 
lookup  table  entry  are  the  “real”  model  and  any  other 
models  that  happen,  by  coincidence,  to  have  a  feature 
pair  with  the  same  relative  distances. 

5  Results 

The  XPATCH  radar  signature  prediction  code  [1] 
is  used  to  generate  target  chips  at  360  azimuth  angles 
(at  a  15°  depression  angle,  90°  squint  angle)  from  the 
Cad  models  of  the  various  objects.  For  5  objects, 
the  non-articulated  set  used  for  model  construction  is 
1800  target  chips.  There  are  7  sets  of  articulated  test 
data  (SCUD  Launcher  with  the  missile  up,  and  the 
T72,  Mlal,  and  T80  tanks  with  60°  and  90°  turret 
angles)  for  an  additional  2520  target  chips.  The  scat¬ 
tering  center  locations  are  obtained  at  local  maxima 
in  the  signal  amplitude  (where  amplitude  is  greater 
than  the  surrounding  eight  neighbors,  if  equal  reject 
unless  last  in  raster  scan  order p  Examples,  at  vari¬ 
ous  azimuths,  of  the  object  geometry,  SAR  image  and 
(strongest  50)  scattering  center  locations  are  shown  for 
both  non-articulated  and  articulated  cases  of  the  T72 
(Figure  10),  Mlal  (Figure  11),  T80  (Figure  12),  and 
the  SCUD  launcher  (Figure  13).  (Figures  10  -  13  are 
not  to  scale  and  the  image  is  displayed  at  8  intensity 
levels,  the  scattering  centers  at  256  levels). 

5.1  Articulated  Objects 

5.1.1  Identification  and  Pose  (4  Object  Table) 

A  4  object  recognition  table  for  the  SCUD  mis¬ 
sile  launcher,  T72,  Mlal  and  T80  tanks  is  con¬ 
structed  from  1440  non-articulated  target  chips  using 
the  model  construction  algorithm  given  above.  The 
experimental  results  of  2520  trials  with  articulated 
objects  for  the  recognition  engine  using  50  scattering 
centers  and  this  4  object  table  are  shown  as  a  con¬ 
fusion  matrix  in  Table  3.  The  overall  performance  is 
a  93.14%  probability  of  correct  identification  (PCI). 
The  azimuth  accuracy  is  shown  in  Table  4,  where  ‘e’ 
is  an  exact  match  and  ‘c’  indicates  a  match  within 
±5°.  The  azimuth  results  are  reported  for  the  hull 
angle.  In  the  case  of  the  Mlal  tank,  decreased  az¬ 
imuth  accuracy  results  when  identifications  are  based 
on  the  (relatively  large)  turret  rather  than  the  hull. 

5.1.2  Number  of  Scattering  Centers  Used 

The  number  of  scattering  center  locations  used  is  a  de¬ 
sign  parameter  that  can  be  tuned  to  optimize  perfor¬ 
mance  of  the  recognition  engine.  For  the  objects  and 
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Figure  10:  Examples  of  T72  tank  geometry,  SAR  image  and  scattering  centers  for  30°  and  60°  hull  azimuths 
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Figure  11:  Examples  of  Mlal  tank  geometry,  SAR  image  and  scattering  centers  for  90°  and  120°  hull  azimuths 


180°Az. 


Figure  12:  Examples  of  T80  tank  geometry,  SAR  image  and  scattering  centers  for  150°  and  180°  hull  azimuths. 
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Figure  13:  Examples  of  SCUD  Launcher  geometry,  SAR  image  and  scattering  centers. 


Table  3:  Confusion  matrix  for  articulated  identifica- 


tion  results  (4  object  table,  50  scattering  centers). 
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Figure  14:  Effect  of  the  number  of  scattering  centers 
on  articulated  recognition  rate. 


Table  4:  Confusion  matrix  for  articulated  pose  accu- 


racy  results  (e  =  exact  pose,  c  =  pose  within  ±5°  ). 
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Non-articulated  models 

SCUD 

down 

T72  1  Mlal  1  T80 

0°  turret 

SCUD  missile  up 

360e 
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360e 

articulations  used  in  Tables  3  and  4,  the  plot  of  overall 
PCI  vs  number  of  scattering  centers  is  shown  in  Figure 
14  (each  point  is  the  result  of  2520  trials) .  The  maxi¬ 
mum  performance  is  achieved  at  50  scattering  centers 
(93.14%),  but  virtually  the  same  performance  could  be 
found  at  42  scattering  centers  (93.10%).  A  more  opti¬ 
mal  system  with  35  scattering  centers  achieves  similar 
performance,  92%  PCI,  with  slightly  less  than  half  the 
storage  and  twice  the  speed  of  50  scattering  centers. 

5.1.3  Articulation  Invariance 

The  detailed  recognition  results  can  be  related  to  the 
articulation  invariance  of  articulated  objects.  The 
recognition  failures  for  the  T72  tank  with  the  turret  at 
90°  are  plotted  on  the  curve  of  percent  invariance  vs 
azimuth  in  Figure  15.  These  results  show  that  recog¬ 
nition  failures  generally  occur  for  azimuths  where  the 
percent  invariance  is  low.  Figure  16  shows  how  the 
PCI  varies  with  the  percent  invariance.  The  points 
at  low  invariance  values  are  misleading,  because  they 
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T72  lank  with  90  degree  turret  (50  scattering  centers) 


Figure  16:  Recognition  rate  and  articulation  invari¬ 
ance  (50  scatterers,  average  of  4  objects). 


are  due  to  a  few  correct  identifications  for  the  Mlal 
tank,  where  the  invariance  (measured  with  respect  to 
the  hull)  is  low,  yet  a  correct  identification  is  made 
from  features  on  the  large  turret.  The  overall  recog¬ 
nition  engine  performance  is  almost  perfect  for  invari¬ 
ance  values  greater  than  40  percent  (ie.  down  to  60 
percent  spurious  data). 

5.1.4  Identification  (5  Object  Table) 

Table  5  shows  results  with  a  5  object  recognition  table 
(50  scattering  centers  for  each  model),  with  the  non- 
articulated  FRED  tank  (which  looks  similar  to  the 
Mlal  tank,  see  Figure  17)  added  as  a  “confuser”  in 
tests  against  the  same  2520  articulated  cases.  In  only 
four  cases  was  a  test  object  confused  with  the  FRED 
tank:  three  times  a  T72  tank  with  60“  turret  was  now 
misidentified  as  a  FRED  tank,  once  a  T72  tank  with 


Table  5:  Confusion  matrix  for  articulated  identifica¬ 
tion  results  (5  object  table,  50  scattering  centers). 
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Figure  17:  Mlal  (left)  and  FRED  (right)  tanks. 


90“  turret,  that  had  been  misidentified  as  an  Mlal 
tank,  was  now  misidentified  as  a  FRED  tank.  The 
overall  PCI  for  the  5  object  table  (with  50  scattering 
center  models)  was  93.02%  versus  93.14%  for  the  4 
object  table. 

5.2  Occluded  Objects 

The  occluded  test  data  is  simulated  by  starting  with 
a  given  number  of  the  strongest  scattering  centers  and 
then  removing  the  appropriate  number  of  scattering 
centers  encountered  in  order,  starting  in  one  of  four 
perpendicular  directions  d,-  (where  di  arid  da  are  the 
cross  range  directions,  along  and  opposite  the  flight 
path  respectively,  and  dg  and  are  the  up  range 
and  down  range  directions).  Then  the  same  number 
of  scattering  centers  (with  random  magnitudes)  are 
added  back  at  random  locations  within  the  original 
bounding  box  of  the  chip.  Each  data  set  is  5760  test 
cases  (4  objects  X  4  directions  X  360  azimuths).  For 
the  non-articulated  occluded  tests  (the  objects  are  the 
T72,  Mlal,  and  T80  tanks  with  turret  at  0“  and  the 
SCUD  launcher  with  the  missile  down)  there  are  51 
data  sets  (for  10,  30  and  50  scattering  centers  with 
10  to  90%  occlusion  in  10%  steps  and  the  same  for  20 
and  40  scatterers  plus  55,  65,  and  75%  occlusion)  for  a 
total  of  293,760  test  cases.  Actually,  only  50  data  sets 
with  a  total  of  288,000  test  cases  are  used,  because  the 
data  set  of  10  scattering  centers  with  90%  occlusion 
has  less  than  two  valid  scattering  centers  for  each  case. 
For  the  articulated  occluded  tests  the  same  tanks  are 
used  with  a  90“  turret  and  the  missile  is  erect,  but 
there  are  only  9  data  sets  (for  20  scattering  centers 
with  10  to  90%  occlusion)  for  a  total  of  51,840  test 
cases. 
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Figure  18:  Recognition  rate  and  occlusion  percent. 

5.2.1  Non-articulated  Occluded  Objects 

The  performance  of  the  recognition  engine  with  non- 
articulated  occluded  objects  is  shown  in  Figure  18  in 
terms  of  the  probability  of  correct  identification  (PCI) 
as  a  function  of  percent  occlusion  with  the  ‘number  of 
scattering  centers  used’  as  a  parameter.  The  results 
of  288,000  test  cases  are  shown,  where  each  point  for 
a  specific  percent  occlusion  and  number  of  scattering 
centers  is  the  average  PCI  for  all  4  occlusion  directions, 
the  4  objects  and  the  360  azimuths  (5760  test  cases). 
The  overall  recognition  engine  performance  is  almost 
perfect  for  up  to  60%  occlusion.  (This  corresponds  to 
results  shown  in  Figure  16  for  articulation  invariance 
of  40%  and  above.)  By  80  to  90%  occlusion,  the  re¬ 
sults  are  not  much  better  than  the  0.25  PCI  one  would 
expect  by  chance  from  the  4  possible  objects.  These 
performance  results  are  replotted  as  Figure  19  to  il¬ 
lustrate  the  effect  of  the  number  of  scattering  centers 
used  on  the  recognition  rate  for  the  highly  occluded 
cases.  This  indicates  that  optimal  performance  is  in 
the  region  of  20  to  40  scattering  centers. 

5.2.2  Articulated  Occluded  Objects 

Figure  20  shows  the  average  and  individual  test  ob¬ 
ject  performance  of  the  recognition  engine  (using  20 
scattering  centers)  as  a  function  of  percent  occlusion 
with  4  different  articulated  objects.  The  results  of 
51,840  test  cases  are  shown.  The  overall  performance 
for  these  articulated  objects  with  30%  occlusion  is  a 
0.698  PCI,  almost  the  0.70  system  level  goal  [4]  of  the 
Moving  and  Stationary  Target  Acquisition  and  Recog¬ 
nition  (MSTAR)  program.  The  results  are  consistent 
with  the  average  unoccluded  articulated  results  for  20 
scatterers,  shown  previously  in  Figure  14,  which  would 
be  a  0.899  PCI  at  a  “0%”  occlusion  in  the  occluded 
articulated  results  shown  here  in  Figure  20.  Figure  21 
compares  the  performance  results  of  the  articulated 
and  occluded  articulated  objects  for  cases  with  the 
same  number  of  valid  scatterers  (i.e.  ‘scatterers  used’ 
in  the  unoccluded  cases  or  ‘unoccluded  scatterers’  in 


Figure  19:  Effect  of  number  of  scatterers  on  occluded 
recognition  rate. 


20  scattering  centers,  4  articulated  objects  (SCUD,  T72,  Ml ,  T80) 
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Figure  20:  Articulated  object  recognition  rate  and  oc¬ 
clusion  percent. 

the  occluded  cases).  In  the  occluded  data  the  valid 
points  are  ‘clustered’  in  a  neighborhood  which  gets 
smaller  as  the  occlusion  increases  (and  the  number 
of  valid  scatterers  decreases).  These  relatively  worse 
results  for  the  naturally  ‘clustered’  occluded  articu¬ 
lated  data,  compared  with  the  more  widely  distributed 
unoccluded  articulated  data  (for  the  same  number  of 
valid  scattering  centers),  illustrate  the  importance  of 
the  relatively  rare  long  distances. 

6  Conclusions 

The  XPATCH  generated,  six  inch  resolution  SAR 
imagery  has  great  azimuthal  variation  that  can  be  suc¬ 
cessfully  captured  by  using  360  azimuth  models  for  a 
given  depression  angle.  Useful  articulation  invariant 
features  are  found  in  SAR  images  of  military  vehicles. 
The  feasibility  of  a  new  concept  for  a  SAR  recogni¬ 
tion  engine  to  identify  articulated  and  occluded  ob¬ 
jects  based  on  non-articulated  recognition  models  is 
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Probability  of  Correct  Identification 


Articulated  and  occluded  articulated  performance 


Figure  21:  Articulated  object  and  occluded  articu¬ 
lated  object  performance  results. 

demonstrated.  The  performance  of  the  recognition 
engine  can  be  predicted  by  the  percent  articulation 
invariance  (or  percent  unoccluded)  when  comparing 
the  scattering  center  locations  of  the  articulated  (or 
occluded)  test  images  with  the  non-articulated  model 
scattering  center  locations.  Limited  experiments  show 
that  scaling  to  model  more  objects  provides  similar 
results,  although  performance  will  degrade  depending 
on  the  number  of  coincidental  similarities  found  in  the 
radar  signatures  of  the  objects.  Our  results  indicate 
the  importance  of  the  relatively  rare  long  distances 
and  suggest  an  explanation  why  this  approach,  which 
can  use  long  distances  (if  available) ,  could  have  an  ad¬ 
vantage  over  others  [7]  that  are  restricted  to  a  “neigh¬ 
borhood”  . 

Use  of  real  SAR  images  of  actual  vehicles  (vs 
XPATCH  simulations  from  CAD  models)  would 
change  the  performance  and  detailed  implementation 
of  the  design,  but  not  the  basic  conceptual  approach. 
Our  approach  of  articulation  invariance  simply  treats 
the  articulated  region  as  a  “don’t  care”,  which  ap¬ 
plies  to  both  real  and  simulated  data.  If  real  SAR 
images  are  more  (or  less)  persistent  in  azimuth  than 
XPATCH,  then  the  recognition  engine  would  need 
fewer  (or  more)  azimuth  models.  Real  SAR  images 
and  target  chips  are  likely  to  have  more  noise  than 
the  ideal  models  and  test  data  produced  by  XPATCH, 
however  larger  sets  of  noisy  data  can  be  used  to  pr^ 
duce  useful  recognition  models.  The  noisy  test  data  is 
manifest  as  some  percentage  of  spurious  data,  which  is 
similar  to  what  was  used  to  generate  the  occluded  test 
data  and  the  actual  recognition  results  should  suffer 
as  indicated  in  Figures  18  and  20  on  the  Probability 
of  Correct  Identification  vs  percent  occlusion  curves 
(with  the  corresponding  source  of  the  invalid  scatter¬ 
ing  centers  being  noise  rather  than  ‘occlusion’). 
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Abstract 

This  paper  presents  a  general  approach  to  image 
segmentation  and  object  recognition  that  can  adapt 
the  image  segmentation  algorithm  parameters  to  the 
changing  environmental  conditions.  Segmentation 
parameters  are  learned  using  a  reinforcement  learn¬ 
ing  (RL)  algorithm  that  is  based  on  a  team  of  learn¬ 
ing  automata  and  operates  separately  in  a  global 
or  local  manner  on  an  image.  The  edge-border 
coincidence  is  used  as  a  short  term  reinforcement 
to  reduce  the  computational  expense  due  to  model 
matching  during  the  early  stage  of  object  recogni¬ 
tion.  However,  since  this  measure  is  not  reliable 
for  object  recognition,  it  is  used  later  in  conjunction 
with  model  matching  in  a  closed-loop  object  recog¬ 
nition  system  that  uses  the  results  of  model  match¬ 
ing  as  a  reinforcement  signal  in  a  “biased”  learn¬ 
ing  system.  The  control  switches  between  learning 
integrated  global  and  local  segmentation  based  on 
the  quality  of  segmentation  and  model  matching. 
Results  are  presented  for  both  indoor  and  outdoor 
color  images  where  the  performance  improvement 
is  shown  for  both  image  segmentation  and  object 
recognition  with  experience. 

1  Introduction 

A  model  based  object  recognition  system  has  three 
key  components:  image  segmentation,  feature  ex¬ 
traction,  and  model  matching.  The  goal  of  image 
segmentation  is  to  extract  meaningful  objects  from 
an  input  image.  Image  segmentation  is  an  impor¬ 
tant  and  one  of  the  most  difficult  low-level  computer 
vision  tasks  [6] .  All  subsequent  tasks  including  fea¬ 
ture  extraction,  model  matching,  rely  heavily  on  the 
quality  of  the  image  segmentation  process. 

The  inability  to  adapt  the  image  segmentation  pro¬ 
cess  to  real-world  changes  is  one  of  the  fundamental 
weaknesses  of  typical  model-based  object  recogni- 

*This  work  is  supported  by  DARPA/AFOSR  grant 
F49620-95-1-0424.  The  contents  and  information  do  not 
necesscirily  reflect  the  position  or  the  pohcy  of  the  U.S. 
Goverment. 


tion  systems.  Despite  the  large  number  of  image 
segmentation  algorithms  available  [10],  no  general 
methods  have  been  found  to  process  the  wide  diver¬ 
sity  of  images  encountered  in  real  world  applications. 
Usually,  an  object  recognition  system  is  open-loop. 
Segmentation  and  feature  extraction  modules  use 
default  algorithm  parameters,  and  generally  work  as 
pre-processing  steps  to  the  model  matching  compo¬ 
nent.  The  fixed  sets  of  algorithm  parameters  used 
in  various  image  segmentation  and  feature  extrac¬ 
tion  algorithms  generally  degrade  the  system  per¬ 
formance  and  lack  adaptability  in  real-world  appli¬ 
cations.  These  default  sets  of  algorithm  parameters 
are  usually  obtained  by  the  system  designer  by  fol¬ 
lowing  a  trial  and  error  method.  Parameters  ob¬ 
tained  in  this  way  are  not  robust,  since  when  the 
conditions  for  which  they  are  designed  are  changed 
slightly,  these  algorithms  generally  fail  without  any 
graceful  degradation  in  performance. 

The  usefulness  of  a  set  of  algorithm  parameters  in 
a  system  can  only  be  determined  by  the  system’s 
output,  i.e.,  recognition  performance.  To  recognize 
different  objects  or  instances  of  the  same  object  in 
an  image,  we  may  need  different  sets  of  parameters 
locally  due  to  the  changes  in  local  image  properties, 
such  as  brightness,  contrast,  etc.  Also  the  changing 
environmental  conditions  (such  as  the  time  of  the 
day,  weather  conditions,  etc.),  affect  the  appearance 
of  an  image  which  requires  the  capability  to  adapt 
the  representation  parameters  for  multi-scenario  ob¬ 
ject  recognition.  To  achieve  robust  performance  in 
real-world  applications,  a  need  exists  to  apply  learn¬ 
ing  techniques  which  can  efficiently  search  image  seg¬ 
mentation  and  feature  extraction  algorithm  param¬ 
eter  spaces  and  find  parameter  values  which  yield 
optimal  results  for  the  given  recognition  task.  In 
this  paper,  our  goal  is  to  develop  a  general  approach 
to  a  learning  integrated  model-based  object  recog¬ 
nition  system,  which  has  the  ability  to  continuously 
adapt  to  normal  environmental  variations. 

In  the  remainder  of  the  section  1,  we  present  an 
overview  of  the  approach,  related  work  and  the  con¬ 
tributions  of  the  paper.  Section  2  gives  the  details  of 
the  approach  and  discusses  algorithms  used  in  this 
research.  Section  3  provides  the  experimental  results 
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Figure  1:  Reinforcement  learning  integrated  image 
segmentation  and  object  recognition  system. 

for  segmentation  and  recognition  on  both  indoor  and 
outdoor  color  images.  Finally,  section  4  presents  the 
conclusions  and  the  future  work. 

1.1  Overview  of  the  approach 

In  this  paper,  we  present  a  general  approach  to  re¬ 
inforcement  learning  integrated  image  segmentation 
and  object  recognition.  A  reinforcement  learning 
system  is  integrated  into  the  model-based  object 
recognition  system  to  close  the  loop  between  model 
matching  and  image  segmentation.  The  basic  as¬ 
sumption  is  that  we  know  the  models  of  the  objects 
that  are  to  be  recognized,  but  we  do  not  know  the 
number  of  objects  and  their  locations  in  the  image. 
The  goal  of  the  system  is  to  maximize  the  matching 
confidence  by  finding  a  set  of  image  segmentation 
algorithm  parameters  for  the  given  recognition  task 
(We  have  not  discussed  the  problem  of  feature  ex¬ 
traction  parameters  in  this  paper.  It  is  described 
in  a  separate  paper  by  Peng  and  Bhanu  [1]).  Thus, 
we  address  the  problem  of  adaptive  segmentation  as 
finding  a  set  of  parameters  for  the  given  model  and 
given  input  image.  It  reflects  the  fact  that  there 
may  not  exist  a  single  set  of  “optimal”  parameters 
which  can  be  used  for  recognizing  different  objects 
in  a  given  image.  Figure  1  provides  an  overview  of 
the  system.  Basically,  the  system  consists  of  image 
segmentation,  feature  extraction,  model  matching, 
and  reinforcement  learning  modules.  The  image  seg¬ 
mentation  component  extracts  meaningful  objects 
from  input  images,  feature  extraction  step  performs 
polygonal  approximation  of  connected  components, 
and  the  model  matching  step  tells  us  which  regions 
in  the  segmented  image  contain  the  recognized  ob¬ 
ject.  The  model  matching  module  indirectly  eval¬ 
uates  the  performance  of  the  image  segmentation 
and  feature  extraction  processes  by  generating  a  real 
valued  matching  confidence  indicating  the  degree  of 
success.  This  real  valued  matching  confidence  is 


then  used  to  drive  learning  for  image  segmentation 
parameters  in  a  reinforcement  learning  framework. 

Given  the  computational  expense  for  performing 
model  matching,  our  approach  uses  edge-border  co¬ 
incidence  [5]  as  a  segmentation  evaluation  measure 
to  find  an  initial  point  from  which  to  begin  the  search 
through  weight  space.  However,  since  this  measure 
is  not  reliable  as  matching  confidence,  we  use  it  in 
conjunction  with  model  matching  in  a  closed-loop 
system  to  adapt  segmentation  parameters  to  current 
input  image  conditions.  Subsequent  feature  extrac¬ 
tion  and  model  matching  are  carried  out  for  each 
connected  component  which  passes  through  the  size 
filter  based  on  the  expected  size  of  objects  of  inter¬ 
est  in  the  image.  The  highest  matching  confidence 
is  taken  as  the  reinforcement  signal.  Learning  takes 
place  as  a  result  of  interactions  between  segmenta¬ 
tion  and  model  matching. 

Significant  differences  in  characteristics  exist  be¬ 
tween  an  image  and  its  subimages,  so  operating  con¬ 
ditions  are  tuned  to  these  differences  to  achieve  opti¬ 
mal  performance  of  segmentation  and  model  match¬ 
ing.  For  example,  to  recognize  two  objects  in  an  im¬ 
age  or  a  single  object  at  different  locations,  it  is  often 
difficult,  if  not  impossible,  to  meet  all  requirements 
with  one  process.  It  is  essential  to  localize  compu¬ 
tation  to  meet  each  individual  requirement.  Thus, 
we  adopt  a  control  that  switches  between  global  and 
local  segmentation  phases  based  on  the  quality  of 
image  segmentation  and  model  macthing. 

The  reinforcement  learning  integrated  image  seg¬ 
mentation  and  object  recognition  system  is  designed 
to  be  fundamental  in  nature  and  is  not  dependent 
on  any  specific  image  segmentation  algorithms  or 
type  of  input  images.  Reinforcement  learning  re¬ 
quires  only  the  goodness  of  the  performance  rather 
than  the  details  of  algorithms  that  produce  the  re¬ 
sults.  To  represent  segmentation  parameters  suit¬ 
ably  in  a  reinforcement  learning  framework,  the  sys¬ 
tem  only  needs  to  know  the  segmentation  parame¬ 
ters  and  their  ranges.  In  our  approach,  a  binary  en¬ 
coding  scheme  is  used  to  represent  the  segmentation 
parameters.  While  the  same  task  could  be  learned 
in  the  original  parameter  space,  for  many  types  of 
problems,  including  image  segmentation,  the  binary 
representation  can  be  expected  to  learn  much  faster 
[2].  In  this  sense,  the  system  is  independent  of  a 
particular  segmentation  algorithm  used. 

1.2  Related  work  and  our  contributions 

There  is  no  published  work  on  reinforcement  learn¬ 
ing  integrated  image  segmentation  and  object  recog¬ 
nition  using  multiple  feedbacks.  Bhanu  and  Lee 
[9]  presented  an  image  segmentation  system  which 
incorporates  a  genetic  algorithm  to  adapt  the  seg¬ 
mentation  process  to  changes  in  image  characteris¬ 
tics  caused  by  variable  environmental  conditions.  In 
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their  approach,  multiple  segmentation  quality  mea¬ 
sures  are  used  as  feedback.  Some  of  these  measures 
require  ground-truth  information  which  may  not  be 
always  available.  Peng  and  Bhanu  [2]  presented  an 
approach  in  which  a  reinforcement  learning  system 
is  used  to  close  the  loop  between  segmentation  and 
recognition,  and  to  induce  a  mapping  from  input 
images  to  corresponding  segmentation  parameters. 
Their  approach  is  based  on  global  image  segementa- 
tion  which  is  not  the  best  way  to  detect  objects  in 
an  image;  we  need  the  capability  of  performing  seg¬ 
mentation  based  on  local  image  properties  (local  seg¬ 
mentation).  Another  disadvantage  of  their  method 
is  its  time  complexity  which  makes  it  problematic 
for  practical  application  of  computer  vision. 

For  object  recognition  applications,  the  efficiency  of 
the  learning  techniques  is  very  important.  How  to 
add  bias,  a  prior  or  domain  knowledge  in  a  reinforce¬ 
ment  learning  based  system  is  an  important  topic  of 
research  in  reinforcement  learning  [3]  [7]  [8].  For  the 
RATLE  system,  Maclin  and  Shavlik  [3]  accept  “ad¬ 
vice”  expressed  in  a  simple  programming  language. 
This  advice  is  compiled  into  “knowledge-based”  con- 
nectionist  Q-learning  network.  They  show  that 
advice-giving  can  speed  up  Q-learning  when  the  ad¬ 
vice  is  helpful  (though  it  need  not  be  perfectly  cor¬ 
rect).  When  the  advice  is  harmful,  back  propagation 
training  quickly  overrides  it.  Dorigo  and  Colom- 
betti  [7]  show  that  by  using  a  learning  technique 
called  learning  classifier  system  (LCS),  an  external 
trainer  working  within  a  RL  framework  can  help 
a  robot  to  achieve  a  goal.  Thrun  and  Schwartz 
[8]  have  discussed  methods  for  incorporating  back¬ 
ground  knowledge  into  a  reinforcement  learning  sys¬ 
tem  for  robot  learning. 

In  our  approach,  the  edge-border  coincidence  is  used 
to  locate  an  initial  good  point  from  which  to  be¬ 
gin  the  search  through  weight  space  for  high  match¬ 
ing  confidence  values.  Although  as  a  segmentation 
evaluation  measure  the  edge-border  coincidence  is 
not  as  reliable  as  the  matching  confidence,  lower 
edge-border  coincidence  values  always  result  in  poor 
model  matching.  Likewise,  higher  edge-border  coin¬ 
cidence  values  suggest  with  high  probability  that  the 
current  set  of  segmentation  parameters  is  in  a  close 
neighborhood  of  the  optimal  one.  It  is  an  inexpen¬ 
sive  way  to  arrive  at  an  initial  approximation  to  a  set 
of  segmentation  parameters  that  gives  rise  to  the  op>- 
timal  recognition  performance.  The  control  switches 
between  global  and  local  segmentation  processes  to 
optimize  recognition  performance.  To  further  speed¬ 
up  the  learning  process  the  reinforcement  learning  is 
biased  when  the  model  matching  confidence  or  the 
edge-border  coincidence  is  used  as  the  reinforcement 
signal  (note  that  the  reinforcement  learning  is  un¬ 
biased  initially  when  the  edge-border  coincidence  is 
used  as  the  reinforcement  signal).  We  achieve  better 
computational  efficiency  of  the  learning  system  and 
improved  recognition  rates  compared  to  the  system 


developed  by  Peng  and  Bhanu  [2]. 

The  original  contributions  of  the  reinforcement 
learning  integrated  image  segmentation  and  object 
recognition  system  presented  in  this  paper  are: 

•  To  achieve  robustness  for  image  recognition  sys¬ 
tem  operating  in  real  world,  model  matching 
confidence  is  used  as  feedback  to  influence  the 
image  segmentation  process,  and  thus  provide 
an  adaptive  capability. 

•  A  RL  system  based  on  a  team  of  learning  au¬ 
tomata  is  applied  to  represent  and  update  both 
global  and  local  image  segmentation  parame¬ 
ters.  The  learning  system  optimizes  segmenta¬ 
tion  performance  on  each  individual  image  and 
accumulates  segmentation  experience  over  time 
to  reduce  the  effort  needed  to  optimize  future 
unseen  images. 

•  Edge-border  coincidence,  as  a  segmentation 
evaluation  measure,  reduces  computational 
costs  by  avoiding  expensive  model  matching,  es¬ 
pecially  during  earlier  stages  of  learning. 

•  Learning  local  segmentation  parameters  on 
subimages,  which  may  potentially  contain  ob¬ 
jects,  improves  the  performance  of  object  recog¬ 
nition  system. 

•  Explicit  bias  is  used  in  the  RL  based  system 
to  speed  up  the  learning  process  for  adaptive 
image  segmentation. 

2  Technical  Approach 

The  goal  of  our  system  is  to  maximize  the  model 
matching  confidence  by  finding  a  set  of  image  seg¬ 
mentation  algorithm  parameters  for  a  given  recogni¬ 
tion  task.  To  reduce  the  computational  expense  of 
model  matching,  the  edge-border  coincidence  is  first 
used  as  evaluation  function  to  find  a  set  of  parame¬ 
ters  from  which  to  begin  the  learning.  The  segmen¬ 
tation  process  has  two  distinct  phases:  global  and 
local.  While  global  segmentation  is  performed  for 
the  entire  image,  local  segmentation  is  carried  out 
only  for  selected  subimages.  For  a  set  of  input  im¬ 
ages,  the  system  takes  inputs  sequentially.  This  is 
similar  to  human  visual  learning  process,  in  which 
the  visual  stimulus  are  presented  temporally  in  a  se¬ 
quential  manner.  For  the  first  input  image,  since  the 
system  has  no  accumulated  experience,  we  initialize 
the  system  using  random  value  of  weights  in  the  un¬ 
biased  stocastic  RL  algorithm.  For  each  input  image 
thereafter,  the  learning  process  starts  from  the  set 
of  segmentation  parameters  learned  based  on  all  the 
previous  input  images.  The  following  are  the  main 
steps  of  our  learning  algorithm: 

Initial  Approximation.  The  edge-border  coinci¬ 
dence  is  used  as  a  short  term  reinforcement  during 
earlier  stages  of  learning  to  drive  weight  changes 
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without  going  through  the  expensive  model  match¬ 
ing  process.  Once  the  edge-border  coincidence  has 
exceeded  a  given  threshold,  the  weight  changes  will 
be  driven  by  the  matching  confidence,  which  requires 
more  expensive  computation  of  feature  extraction 
and  model  matching. 

Learning  Global  Segmentation.  A  network  of 
biased  Bernoulli  units  generates  a  set  of  segmen¬ 
tation  parameters  from  which  segmentation  is  per¬ 
formed  on  the  entire  image.  The  evaluation  of 
the  segmentation  process  is  provided  by  the  model 
matching  confidence,  which  is  then  used  to  drive 
changes  to  the  weights  according  to  the  reinforce¬ 
ment  learning  algorithm.  We  assume  that  we  have 
a  prior  knowledge  of  the  size  of  objects  of  interest  in 
the  images.  For  those  connected  components  which 
pass  through  the  size  filter  based  on  the  expected 
size  of  objects  of  interest  in  the  image,  we  perform 
feature  extraction  and  model  matching.  The  highest 
matching  confidence  is  taken  as  the  reinforcement  to 
the  learning  system.  If  the  highest  matching  confi¬ 
dence  level  is  above  a  given  switching  theshold,  we 
focus  image  segmentation  and  model  matching  on 
the  connected  component  and  switch  to  the  local 
search  process. 

Learning  Local  Segmentation.  Once  a  con¬ 
nected  component  has  been  extracted  from  the  in¬ 
put  image,  the  local  search  begins  to  find  the  best  fit 
parameters  for  the  subimage.  It  starts  from  the  cur¬ 
rent  estimate  of  weights  that  resulted  from  global 
learning.  Similar  to  global  learning,  the  matching 
confidence  is  used  to  update  the  weights  estimate, 
until  the  matching  confidence  reaches  the  accepting 
threshold  (0.8  in  our  experiments)  or  the  number  of 
iterations  reaches  the  MaxLocal  (in  our  experiments, 
it  is  set  at  20) .  If  after  MaxLocal  loops,  the  matching 
confidence  is  still  under  the  accepting  threshold,  we 
switch  back  to  the  global  learning  process,  continue 
the  learning  from  where  we  switched  to  the  local 
search  process.  If  the  matching  confidence  reaches 
the  accepting  threshold,  the  learning  process  for  the 
current  input  image  is  terminated. 

2.1  Phoenix  image  segmentation  algorithm 

Since  we  are  working  with  color  imagery  in  our  ex¬ 
periments,  we  have  selected  the  Phoenix  segmen¬ 
tation  algorithm  [16]  [18]  developed  at  Carnegie- 
Mellon  University  and  SRI  International.  The 
Phoenix  segmentation  algorithm  has  been  widely 
used  and  tested.  It  works  by  recursively  split¬ 
ting  regions  using  histogram  for  color  features. 
Phoenix  contains  seventeen  different  control  param¬ 
eters,  fourteen  of  which  are  adjustable.  The  four 
most  critical  ones  that  affect  the  overall  results 
of  the  segmentation  process  are  selected  for  adap¬ 
tation:  Hsmooth,  Maxmin,  Splitmin,  and  Height. 
Hsmooth  is  the  width  of  the  histogram  smoothing 
window.  Maxmin  is  the  lowest  acceptable  peak-to- 


Table  1:  Ranges  for  selected  Phoenix  parameters. 


Parameter 

Sampling  Formula 

Range 

Hsmooth: 
hs  €  fO  :  31] 

hsmooth=l  +  2  X  hs 

1-63 

Maxmin: 
mm  €  [0  :  31] 

ep=ln(100)+0.05  X  mm 
maxmin  =  exp(ep)  +  0.5 

100  -  471 

Splitmin: 
sm  £  [0  :  31] 

8plitmin=9  +  2  x  sm 

9-71 

Height: 
h  6  [0  :  31] 

height=l  +  2  ♦  h 

1-63 

valley  height  ratio.  Splitmin  represents  the  mini¬ 
mum  area  for  a  region  to  be  automatically  consid¬ 
ered  for  splitting.  Height  is  the  minimum  acceptable 
peak  height  as  a  percentage  of  the  second  highest 
peak.  Each  parameter  has  32  possible  values.  The 
resulting  search  space  is  2^°  sample  points.  Each 
of  the  Phoenix  parameters  is  represented  using  5 
bit  binary  code,  with  each  bit  represented  by  one 
Bernoulli  unit.  To  represent  4  parameters,  we  need 
a  total  of  20  Bernoulli  units.  More  details  about 
Phoenix  are  given  in  the  report  by  Laws  [16]. 

2.2  Segmentation  evaluation 

Given  that  feature  extraction  and  model  matching 
are  computationally  expensive  processes,  it  is  imper¬ 
ative  that  initial  approximation  be  made  such  that 
overall  computation  can  be  reduced.  To  achieve  this 
objective,  we  introduce  a  secondary  feedback  signal 
-  segmentation  evaluation  that  evaluates  the  image 
segmentation  quality.  There  are  a  large  number  of 
segmentation  quality  measures  that  have  been  sug¬ 
gested.  The  segmentation  evaluation  we  selected  is 
the  edge-border  coincidence  [9] [17],  which  measures 
the  overlap  of  the  region  borders  in  the  segmented 
image  relative  to  the  edges  found  using  an  edge  de¬ 
tector,  and  does  not  depend  on  any  ground-truth  in¬ 
formation.  In  this  approach,  we  use  the  Sohel  edge 
detector  to  compute  the  necessary  edge  information. 
Edge-border  coincidence  is  defined  as  follows.  Let  E 
be  the  set  of  pixels  extracted  by  the  edge  operator 
and  S  be  the  set  of  pixels  found  on  the  region  bound¬ 
aries  obtained  from  the  segmentation  algorithm: 

n(E  n  S) 

Edge  —  border  coincidence  =  — — , 

where  n{A)  is  the  number  of  elements  in  set  A 

Figure  2  shows  the  Sobel  edge  image  of  an  exper¬ 
imental  indoor  color  image  and  the  boundaries  of 
the  segmented  image  using  the  Phoenix  segmenta¬ 
tion  algorithm.  The  edge-border  coincidence  for  the 
segmented  image  is  0.6825.  Segmentation  evalua¬ 
tion  indicates  the  quality  of  the  segmentation  pro¬ 
cess.  Matching  confidence,  the  recognition  system’s 
output,  indicates  the  confidence  of  the  model  match¬ 
ing  process,  and  indirectly  shows  the  segmentation 
quality  of  the  recognized  object.  It  is  possible  that 
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Figure  2:  Edge-border  coincidence,  (a)  input  image; 

(b)  Sobel  edge  magnitude  image  (threshold  =  200); 

(c)  boundaries  of  the  segmented  image.  Segmen¬ 
tation  parameters  are;  Hsmooth—7,  Maxmm=l2S, 
Splitmin=47,  Height=Q0. 


segmentation  evaluation  is  high  and  matching  confi¬ 
dence  level  is  low,  or  segmentation  evaluation  is  low 
and  matching  confidence  is  high.  Figure  3(a)  shows 
that  global  segmentation  evaluation  is  not  well  corre¬ 
lated  with  matching  confidence.  However,  local  seg¬ 
mentation  evaluation,  which  measures  the  overlap 
between  the  edges  and  region  borders  of  a  subimage, 
is  strongly  correlated  to  the  matching  confidence,  as 
shown  in  Figure  3(b). 

Although  the  global  segmentation  evaluation  does 
not  correctly  predict  the  matching  confidence,  for 
our  purpose  it  is  sufficient  to  drive  initial  estimates. 
If  the  edge-border  coincidence  is  under  a  threshold, 
which  indicates  a  low  possibility  to  get  a  good  recog¬ 
nition  result,  the  system  repeats  the  initial  estima¬ 
tion  process  using  the  edge-border  coincidence  as 
the  sole  reinforcement  feedback  signal  until  the  edge- 
border  coincidence  is  greater  than  the  threshold.  At 
that  time,  the  segmentation  performance  will  be  de¬ 
termined  completely  by  the  model  matching. 


2.3  Reinforcement  learning  for  image 
segmentation 

Reinforcement  learning  is  the  problem  faced  by  an 
agent  that  must  learn  behavior  through  trial-and- 
error  interactions  with  a  dynamic  environment.  It 
is  appropriately  thought  of  as  a  class  of  problems, 
rather  than  as  a  set  of  techniques  [4].  This  type 
of  learning  has  a  wide  variety  of  applications,  rang¬ 
ing  from  modeling  behavior  learning  in  experimental 
psychology  to  building  active  vision  systems.  The 
term  reinforcement  comes  from  studies  of  animal 
learning  in  experimental  psychology.  The  basic  idea 
is  that  if  an  action  is  followed  by  a  satisfactory  state 
of  affairs  or  an  improvement  in  the  state  of  affairs, 
then  the  tendency  to  produce  that  action  is  rein¬ 
forced.  Reinforcement  learning  is  similar  to  super¬ 
vised  learning  in  that  it  receives  a  feedback  to  adjust 
itself.  However,  the  feedback  is  evaluative  in  the  case 
of  reinforcement  learning.  In  general,  reinforcement 
learning  is  more  widely  applicable  than  supervised 
learning  and  it  provides  a  competitive  approach  to 
building  autonomous  learning  systems  that  must  op¬ 
erate  in  real  world. 
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Figure  3:  (a)  Global  edge-border  coincidence  vs. 
matching  confidence;  (b)  Local  edge-border  coinci¬ 
dence  vs.  matching  confidence  for  recognizing  the 
cup  in  the  image  shown  in  Figure  2(a). 


There  are  several  recisons  why  we  apply  reinforce¬ 
ment  learning  in  our  computer  vision  system.  First, 
reinforcement  learning  requires  knowing  only  the 
goodness  of  the  system  performance  rather  than  the 
details  of  algorithms  that  produce  the  results.  In 
the  object  recognition  system,  model  matching  con¬ 
fidence  indirectly  evaluates  the  performance  of  im¬ 
age  segmentation  and  feature  extraction  processes. 
It  is  a  natural  choice  to  select  matching  confidence  as 
a  reinforcement  signal.  Second,  convergence  is  guar¬ 
anteed  for  several  reinforcement  learning  algorithms. 
Third,  reinforcement  learning  is  well  suited  to  the 
multi-level  object  recognition  problems  in  image  un¬ 
derstanding.  It  can  systemically  assign  rewards  to 
different  levels  in  a  computer  vision  system. 


Figure  4:  Basic  structure  of  a  Bernoulli  unit. 

The  particular  class  of  reinforcement  learning  algo¬ 
rithms  employed  in  our  system  is  the  connectionist 
REINFORCE  algorithm  [11],  where  units  in  such 
a  network  are  Bernoulli  quasi-linear  units.  Figure 
4  shows  the  basic  structure  of  a  Bernoulli  unit.  A 
team  of  five  independent  Bernoulli  units  represent 
a  segmentation  parameter  with  32  possible  values. 
The  output  of  each  unit  is  either  1  or  0,  determined 
stochastically  using  the  Bernoulli  distribution  with 
probability  mass  function  p  =  f{s),  where  /  is  the 
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logistic  function.  For  such  an  unit,  p  represents  the 
probability  of  choosing  1  as  its  output  value. 

/(»)  =  1  ^  ® 

i  +  e  . 

where  Wij  is  the  weight  of  the  jth  input  for  unit 
i,  and  xj  is  the  jth  input  value  for  each  unit.  In 
the  reinforcement  learning  paradigm,  the  learning 
component  uses  the  reinforcement  r{t)  to  drive  the 
weight  changes  according  to  a  particular  reinforce¬ 
ment  learning  algorithm  used  by  the  network.  The 
specific  algorithm  we  used  has  the  following  form: 
for  each  unit,  at  the  tth  time  step,  after  generating 
output  y{t)  and  receiving  reinforcement  signal  r{t), 
increment  each  weight  Wij  by 

Awij{t)  =  a[r{t)-f{t-l)][yiit)-yiit-l)]xj-Swij{t) 

where  a  is  the  learning  rate,  S  is  the  weight  de¬ 
cay  rate,  xj  is  the  input  to  each  Bernoulli  unit,  y,- 
is  the  output  of  the  ith  Bernoulli  unit.  The  term 
r^t)  -  f{t  —  1)  is  called  the  reinforcement  factor,  and 
yi{t)  -  Viit  -  1)  is  the  eligibility  of  the  weight  Wij. 
f{i)  is  the  exponentially  weighted  average  of  prior 
reinforcement  values, 

f{t)  =  7f(t  -  1)  -b  (1  -  l)r{t),  with  f(0)  =  0 

7  is  the  trace  parameter.  Similarly,  y,(t)  is  an  av¬ 
erage  of  past  values  of  y,-  computed  by  the  same 
exponential  weighted  scheme  used  for  f{t), 

Viit)  =  7^1  (t  - 1)  +  (1  -  7)y>(t) 

The  algorithm  has  the  convergence  property  [11] 
such  that  it  statistically  climbs  the  gradient  of  ex¬ 
pected  reinforcement  in  weight  space.  The  weight 
decay  is  used  as  a  simple  method  to  force  the  sus¬ 
tained  exploration  of  the  weight  space. 

Note  that  a  team  of  20  Bernoulli  units  represents 
the  four  image  segmentation  parameters  selected  for 
learning.  Each  bit  of  a  parameter  is  independent  of 
each  other.  Thus,  it  allows  us  to  search  the  param¬ 
eter  space  thoroughly. 

2.4  Feature  extraction  and  model  matching 

Feature  extraction  consists  of  finding  polygon  ap¬ 
proximation  tokens  for  each  connected  component 
obtained  after  image  segmentation.  To  speed  up  the 
learning  process,  we  assume  that  we  have  the  prior 
knowledge  of  the  approximate  size  (area)  of  the  ob¬ 
ject,  and  only  those  connected  components  whose 
area  (number  of  pixels)  are  comparable  with  the  area 
of  the  model  object  are  approximated  by  a  polygon. 
In  Figure  1,  the  region  filter  selects  those  connected 
components  whose  areas  are  in  the  expected  range. 
For  example,  in  our  experiment  on  indoor  images, 
the  cup  is  the  target  object.  The  expected  area  is 
from  200  to  450  pixels.  Figure  5  shows  the  bound¬ 
aries  of  a  segmented  image,  selected  regions  whose 
areas  are  in  the  expected  range,  and  the  polygon 


(a)  (b)  (c) 


Figure  5:  (a)  Boundaries  of  the  segmented  im¬ 

age  shown  in  Figure  2(a)  (segmentation  parame¬ 
ters  are:  Hsmooth=7,  Maxmin=128,  Splitmin  =47, 
Height— hA).  (b)  Selected  regions  whose  areas  are 
in  the  expected  range  (200  -  450  pixels),  (c)  Poly¬ 
gon  approximation  of  these  regions  (parameters  as 
specified  in  this  section) . 

approximation  of  these  regions.  The  polygon  ap¬ 
proximation  is  implemented  by  calling  the  polygon 
approximation  routine  in  Khoros  [12].  The  resulting 
polygon  approximation  is  a  vector  image  to  store  the 
result  of  the  linear  approximation.  The  image  con¬ 
tains  two  points  for  each  estimated  line.  The  poly¬ 
gon  approximation  has  a  fixed  set  of  parameters: 

•  Minimal  segment  length  for  straight  line  -  5. 
When  the  estimated  straight  line  has  a  length 
less  than  this  threshold,  it  is  skipped  over. 

•  Elimination  percentage  -  0.1.  Percentage  of  line 
length  rejected  to  calculate  parameters  of  the 
straight  line. 

•  Approximation  error  -  0.6.  Threshold  Value  for 
the  approximation  error.  When  the  calculated 
error  is  greater  than  this  value,  the  line  is  bro¬ 
ken. 

Model  matching  employs  a  cluster-structure  match¬ 
ing  algorithm  [14]  which  is  based  on  forming  the  clus¬ 
ters  of  translational  and  rotational  transformations 
between  the  object  and  the  model.  The  algorithm 
takes  as  input  two  sets  of  tokens,  one  of  which  rep¬ 
resents  the  stored  model  and  the  other  represents 
the  input  region  to  be  recognized.  It  then  per¬ 
forms  topological  matching  between  the  two  token 
sets  and  computes  a  real  number  that  indicates  the 
confidence  level  of  the  matching  process.  Basically, 
the  technique  consists  of  three  steps:  clustering  of 
border  segment  transformations;  finding  continuous 
sequences  of  segments  in  appropriately  chosen  clus¬ 
ters;  and  clustering  of  sequence  average  transforma¬ 
tion  values.  More  details  about  this  algorithm  are 
given  in  [14]. 

2.5  Biased  reinforcement  learning  for 
image  segmentation 

In  the  RL  algorithm  as  described  in  section  2.3,  each 
of  the  bits  of  each  of  the  parameters  is  independent. 
The  output  of  each  bit  depends  on  the  value  of  p, 
which  represents  the  probability  of  an  unit  to  choose 
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TTinM  Dif(«rant  Run*  of  th«  Biaaod  Schom* 


Numbor  of  Loop* 

(a) 


Thrao  Diffarant  Runa  of  th*  Unbiaaad  Schama 


Numbor  of  Loop* 

(b) 

Figure  6:  Matching  confidence  history  of  three  runs 
of  the  biased  and  unbiased  RL  algorithms  on  the  im¬ 
age  shown  in  Figure  2(a).  (a)  biased;  (b)  unbiased. 


1  as  its  output.  In  the  initialization  phase,  we  use  the 
unbiased  RL  algorithm  in  which  the  output  of  each 
bit  of  a  parameter  is  determined  in  the  following 
way: 

{1  with  probability  p 
0  with  probability  l—p 


It  is  “unbiased”  in  that  the  output  of  a  bit  is  gov¬ 
erned  solely  by  the  Bernoulli  probability  law.  The 
advantage  is  that  rapid  changes  in  output  values  al¬ 
low  giant  leaps  in  the  search  space,  which  in  turn 
enables  the  learning  system  to  quickly  discover  sus¬ 
pected  high  pay-off  regions.  However,  once  the  sys¬ 
tem  has  arrived  at  the  vicinity  of  a  local  optimum,  as 
will  be  the  case  after  the  initial  estimation,  changes 
in  the  most  significant  bit  will  drastically  alter  the 
parameter  value,  often  jumping  out  of  the  neigh¬ 
borhood  of  the  local  optimum.  Ideally,  once  the 
learning  system  discovers  that  it  is  within  a  possi¬ 
ble  high  pay-off  region,  it  should  attempt  to  capture 
the  regularities  of  the  region.  This  then  biases  fu¬ 
ture  search  toward  points  within  it.  The  challenge, 
of  course,  is  to  have  a  learning  algorithm  that  allows 
the  parameters  controlling  the  search  distribution  to 
be  adjusted  so  that  this  distribution  comes  to  cap¬ 
ture  this  knowledge.  The  algorithm  described  here 
shows  some  promise  in  this  regard.  In  order  to  force 
parameters  to  change  slowly,  after  the  initialization 
phase,  we  apply  a  biased  RL  algorithm  in  which  the 
two  most  significant  bits  of  a  parameter  are  forced 
to  change  in  a  slower  fashion  as: 


r  1  if  p  >  0.5 
1  0  otherwise 


and  other  bits  use  the  the  same  rule  as  described  in 
the  unbiased  RL  algorithm.  Figure  6  shows  the  ex¬ 
perimental  results  of  the  two  schemes  on  the  image 


procedure  Iiiitializatioii( ) 

generate  a  set  of  random  weights 
repeat 

compute  the  segmentation  parameters 
segment  the  image  and  compute  edge-border  coincidence 
r(i)  =  edge-border  coincidence,  update  weights 
until  edge-border  coincidence  >  EBl 

procedure  Global_Segmentation( ) 

r(0)  =  0.5,  highest_matching_confidence  =  0 
for  i  from  1  to  MaxGlobal  do 

for  each  connected  component  which  passes  the  size  filter  do 
feature  extraction  and  model  matching 
if  matching_confidence  >  Switch  then 
Local_Segmentation( ) 

if  matching_confidence  >  highest_matching_confidence 
highest_matching_confidence  =  matching_confidence 
r(i)  =  highest_matching_confidence 
if  recognized  all  the  connected  components  then  exit 
count  =  0 
repeat 

compute  the  segmentation  parameters  using  r(i) 
segment  the  image  using  the  current  set  of  parameters 
count+-H 

if  count  >  MaxSeg  then  exit 
until  edge-border  coincidence  >  EBl 

procedure  LocaI_Scgmentation( ) 
extract  subimage  from  the  input  image 
compute  standard  deviations  of  parts  of  the  subimage 
copy  the  weights  from  global  to  local  process 
count  =  0 

while  count  <  MaxLocal  do 

subimage  segmentation,  feature  extraction,  and  model  matching 
update  weights  using  matching  confidence  as  reinforcement 
if  matching  confidence  >  Accept  then  recognized  and  return 
count-H- 


Figure  7:  Algorithm  description. 


shown  in  Figure  2(a).  In  this  experiment,  we  only 
apply  the  initialization  followed  by  global  learning 
without  switching  between  global  and  local  learn¬ 
ing.  The  results  show  that  the  biased  RL  algorithm 
demonstrates  a  speed  up  of  2  -  3. 


2.6  Algorithm  description 

Figure  7  shows  the  implementation  of  our  algorithm. 
The  algorithm  works  by  switching  between  global 
and  local  segmentation.  Initially,  if  the  system  has 
no  accumulated  knowledge,  the  edge-border  coinci¬ 
dence  is  used  as  the  evaluation  function  to  search 
a  set  of  image  segmentation  parameters  using  unbi¬ 
ased  reinforcement  learning  algorithm.  Otherwise, 
the  input  image  is  segmented  using  the  set  of  pa¬ 
rameters  learned  from  previous  images.  EBl  and 
EB2  are  two  thresholds  for  edge-border  coincidence. 
During  the  initial  unbiased  reinforcement  learning 
phase,  if  the  edge-border  coincidence  is  greater  than 
EBl  (  =  0.5  in  our  experiments),  then  we  can  start 
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Figure  8:  Row  1:  input  images;  row  2,  3;  corre¬ 
sponding  segmented  image  and  recognized  object. 
For  each  input  image,  global  segmentation  evalua¬ 
tion,  local  segmentation  evaluation  for  the  selected 
object,  and  matching  confidence  are  (0.67,  0.74, 
0.87);  (0.87,  0.62,  0.93);  (0.22,  0.82,  0.91);  (0.68, 
0.73,  0.92).  The  learned  Phoenix  segmentation  pa¬ 
rameters  Hsmooth,  Maxmin,  Splitmin,  and  Height 
after  local  learning  process  are  (7  122  47  52);  (7  128 
47  52);  (5  471  19  58);  (11  192  59  48). 


the  learning  process  with  a  high  expectation  to  gen¬ 
erate  good  recognition  results.  During  the  global 
segmentation  phase,  if  the  segmentation  quality  is 
less  than  EB2  (  =  0.4  in  our  experiments),  the  object 
is  less  likely  to  be  present  in  the  segmented  image, 
and  choosing  another  set  of  parameters  using  the 
biased  RL  algorithm  with  the  current  reinforcement 
signal  can  speed  up  the  process. 


Figure  9:  Row  1:  input  images;  row  2,  3:  corre¬ 
sponding  segmented  image  and  recognized  object. 
For  each  input  image,  global  segmentation  evalua¬ 
tion,  local  segmentation  evaluation  for  the  selected 
object,  and  matching  confidence  are  (0.59,  0.51, 
0.82);  (0.79,  0.57,  0.85);  (0.85,  0.76,  0.88);  (0.82, 
0.53,  0.92).  The  learned  Phoenix  segmentation  pa¬ 
rameters  Hsmooth,  Maxmin,  Splitmin,  and  Height 
after  local  learning  process  are  (11  367  43  26);  (11 
259  23  46);  (11  259  29  56);  (9  276  31  46). 


ofYIQ  model  of  color  images.  For  the  indoor  irnages, 
the  desired  object  is  the  cup  in  the  image,  and  in  the 
outdoor  images,  the  target  object  is  the  traffic  sign. 
The  expected  size  of  the  cup  and  the  traffic  sign  are 
200  to  450  pixels  and  36  to  100  pixels,  respectively. 


In  the  global  segmentation  procedure,  if  the  global 
segmentation  loops  more  than  MaxGlobal,  we  con¬ 
clude  that  the  object  does  not  appear  in  the  image 
and  terminate  the  learning  process  for  the  given  in¬ 
put  image.  For  each  connected  component  which 
passes  the  region  filter,  if  the  matching  confidence 
is  greater  than  Switch,  then  we  can  switch  the  con¬ 
trol  from  global  to  local  segmentation.  During  local 
segmentation,  if  the  matching  confidence  reaches  Ac¬ 
cept,  we  conclude  that  the  connected  component  is 
the  recognized  model  object.  If  the  local  segmen¬ 
tation  loops  more  than  MaxLocal,  the  control  will 
switch  back  to  global  segmentation  since  the  object 
is  not  likely  to  be  extracted  in  the  subimage  and  we 
resume  the  global  segmentation  process. 

3  Experimental  Results 

The  system  is  verified  through  a  set  of  12  indoor 
and  a  set  of  12  outdoor  color  images.  These  imaps 
are  acquired  at  different  times  and  different  viewing 
distances  with  varying  lighting  conditions.  The  size 
of  indoor  images  is  120  by  160  pixels,  and  the  size  of 
outdoor  images  is  120  by  120  pixels.  Each  image  is 
decomposed  into  4  images  for  Phoenix  segmentation 
-  red,  green,  blue  components,  and  the  Y  component 


Based  on  the  size  of  the  object  to  be  recognized  in 
the  image,  we  divide  the  Y  component  image  into 
48  subimages  for  the  indoor  images,  and  36  subim¬ 
ages  for  the  outdoor  images.  Each  subimage’s  size 
is  20  by  20  pixels.  The  standard  deviations  of  those 
subimages  serve  as  inputs  to  each  Bernoulli  unit, 
i.e.,  each  Bernoulli  unit  has  a  total  of  48  inputs  (and 
therefore,  48  weights)  for  the  indoor  image,  and  has 
a  total  of  36  inputs  (36  weights)  for  the  outdoor  im¬ 
age.  To  learn  the  four  selected  Phoenix  segmentation 
parameters,  we  need  20  Bernoulli  units.  So  there  is 
a  total  of  960  weights  for  indoor  images,  and  720 
weights  for  outdoor  images. 

For  the  team  of  20  Bernoulli  units,  the  parameters 
a,  7,  and  J  are  determined  empirically,  and  they  are 
kept  constant  for  all  images.  In  our  experiments, 
a  =  0.02,  7  =  0.9,  and  S  =  0.01,  EBl  =  0.5,  EB2  = 
0.4,  MaxGlobal,  MaxLocal,  and  MaxSeg  are  all  set  to 
20.  The  threshold  for  matching  confidence  Switch  = 
0.6,  and  Accept  =  0.8.  Threshold  used  for  extracting 
edges  using  Sobel  operator  is  set  at  200. 

3.1  Results  on  indoor  and  outdoor  images 
Figure  8  and  9  show  the  experimental  results  on  the 
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Figure  10:  (a)  CPU  time  for  5  different  runs  on  12 
indoor  images  and  the  average;  (b)  Number  of  loops 
for  5  different  runs  on  12  indoor  images  and  the  aver¬ 
age;  (c)  CPU  time  for  5  different  runs  on  12  outdoor 
images;  (d)  Number  of  loops  for  5  different  runs  on 
12  outdoor  images. 
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Figure  11:  Comparison  of  two  approaches:  scheme 
1-approach  presented  in  this  paper,  scheme  2-Peng 
and  Bhanu’s  approach  [2],  (a)  Comparison  of  the 
average  CPU  time  of  5  different  runs  on  12  indoor 
images;  (b)  Comparison  of  the  acumulated  average 
CPU  time  of  5  different  runs  on  12  indoor  images. 


set  of  12  indoor  color  images  and  the  set  of  12  out¬ 
door  color  images.  For  each  indoor  image,  the  glob¬ 
ally  segmented  image  using  the  set  of  learned  param¬ 
eters  and  the  extracted  object  which  has  been  finally 
recognized,  are  presented.  For  each  set  of  images, 
the  12  images  are  taken  sequentially.  Except  for  the 
first  image,  the  learning  process  for  each  image  starts 
from  the  global  segmentation  parameters  learned 
from  all  the  previous  images.  For  the  first  input 
image,  the  learning  system  is  initialized  using  the 
unbiased  RL  algorithm.  Usually,  it  takes  less  than 
45  iterations  to  find  a  set  of  segmentation  algorithm 
parameters  which  produces  high  edge-border  coinci¬ 
dence.  Figure  8  and  9  also  show  the  global  edge- 
border  coincidence,  local  edge-border  coincidence, 
model  matching  confidence,  and  the  four  learned 
segmentation  parameters  after  local  learning  process 
for  each  input  image. 

Figure  10  shows  the  CPU  time  for  the  12  indoor  im¬ 
ages  and  12  outdoor  images  for  five  different  runs, 
and  the  number  of  loops  for  each  input  image,  which 
is  the  sum  of  all  the  loops  involved  in  the  global 
learning  and  local  learning  processes.  These  two 
curves  show  the  learning  capability  of  the  system, 
i.e.,  the  system  uses  less  and  less  CPU  time  with 
experience  to  find  a  set  of  segmentation  parameters 
and  correctly  recognizes  the  object.  The  number  of 
learning  loops  decreases  with  the  accumulation  of 
experience. 

3.2  Comparison  of  the  two  approaches 

In  this  section  we  compare  the  performance  of  our 
system  as  shown  in  Figure  1  with  the  approach  dis¬ 


cussed  in  the  paper  by  Peng  and  Bhanu  [2].  We  show 
the  effect  of  incorporating  segmentation  evaluation 
using  the  edge-border  coincidence  into  the  learning 
system  and  the  impact  of  global  and  local  segmen¬ 
tations  on  model  matching. 

The  key  differences  between  the  two  methods  are  the 
introduction  of  the  local  segmentation  process,  the 
biasing  of  RL  algorithm,  and  the  use  of  edge-border 
coincidence  as  an  evaluation  of  the  segmentation 
performance  during  earlier  stages  of  learning  in  or¬ 
der  to  reduce  the  computational  expense  stemming 
from  model  matching.  The  segmentation  process  al¬ 
ternates  between  the  whole  image  and  its  subcom¬ 
ponents.  The  local  segmentation  is  highly  desirable 
when  there  are  multiple  targets  or  a  single  target 
at  multiple  locations  with  different  local  character¬ 
istics.  It  can  dramatically  improve  the  recognition 
performance.  The  biasing  of  RL  algorithm  reduces 
computational  time  as  illustrated  in  Figure  6. 

In  the  paper  by  Peng  and  Bhanu  [2],  the  matching 
confidence  is  the  only  feedback  that  drives  learning. 
Although  it  is  undoubtedly  the  most  reliable  mea¬ 
sure,  it  is  relatively  expensive  to  compute.  Here  the 
edge-border  coincidence  provides  us  with  a  cheap 
way  to  find  a  good  point  from  which  to  begin  the 
more  expensive  search  for  high  matching  confidence 
values.  Figure  11  shows  the  comparison  results  of 
the  two  schemes:  our  scheme  (scheme  1)  and  Peng 
and  Bhanu’s  scheme  (scheme  2).  Although  good  ini¬ 
tial  estimates  may  not  always  result  in  faster  dis¬ 
covery  of  high  matching  confidence  values,  the  edge- 
border  coincidence  seems  to  work  well  in  practice  for 
all  the  problems  we  have  experimented. 


1153 


4  Conclusions  and  future  work 

We  have  presented  a  proof-of-the-principle  of  a  gen¬ 
eral  approach  for  adaptive  image  segmentation  and 
object  recognition.  The  approach  combines  a  do¬ 
main  independent  simple  measure  for  segmentation 
evaluation  (edge-border  coincidence)  and  domain 
dependent  model  matching  confidence  in  a  reinforce¬ 
ment  learning  framework  in  a  systematic  manner 
to  accomplish  robust  image  segmentation  and  ob¬ 
ject  recognition  simultaneously.  Experimental  re¬ 
sults  demonstrate  that  the  approach  is  suitable  for 
continuously  adapting  to  normal  changes  encoun¬ 
tered  in  real-world  applications. 

For  adapting  to  the  wide  varity  of  images  encoun¬ 
tered  in  real-world  applications,  we  can  develop  an 
autonomous  gain  control  system  which  will  allow  the 
matching  between  different  classes  of  images  taken 
under  significantly  different  weather  conditions  (sun, 
cloud,  snow,  rain)  and  adapt  the  parameters  within 
each  class  of  images.  We  use  image  context  to  di¬ 
vide  the  input  images  into  several  classes  based  on 
image  properties  and  external  conditions,  such  as 
time  of  the  day,  lighting  condition,  etc.  [9].  When 
an  image  is  presented,  we  use  an  image  property 
measurement  module  and  the  available  external  in¬ 
formation  to  find  the  stored  information  for  this  cat¬ 
egory  of  images,  and  start  learning  process  from  that 
set  of  parameters.  This  will  overcome  the  problem 
of  adapting  to  large  variations  between  consecutive 
images. 

The  real  significance  of  using  a  learning  net¬ 
work  to  select  segmentation  parameters  to  optimize 
model  matching  performance  is  that  interconnec¬ 
tions  within  the  network  can  enforce  coordination  of 
the  choices  made  by  the  output  units  in  order  to  con¬ 
centrate  the  search  in  suspected  high-payoff  regions 
of  the  parameter  space.  A  network  that  can  coordi¬ 
nate  the  choices  made  by  the  output  units  should  be 
able  to  generate  certain  combinations  of  bits  with 
greater  probability  than  if  their  individual  compo¬ 
nents  were  selected  independently.  If  the  network 
operates  in  this  way  it  should  expect  to  find  high 
matching  confidence  values  much  more  quickly  than 
without  coordination.  We  plan  to  explore  these  is¬ 
sues  in  the  future. 
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Abstract 

The  new  ultra  wide  band  (UWB)  synthetic  aperture 
radar  (SAR)  exhibits  a  different  reflection  phenom¬ 
enology  on  metallic  surfaces  which  is  character¬ 
ized  by  a  resonant  response.  In  order  to  capture  the 
information  contained  in  the  resonant  response  we 
propose  to  project  the  down  range  profiles  in  a 
Laguerre  function  space,  which  is  an  orthogonal 
expansion  of  decaying  exponentials.  However,  the 
large  energy  contained  in  the  driven  response  (from 
both  metallic  and  non-metallic  objects)  would 
degrade  the  performance  of  CFAR  detectors  based 
on  spatial  information.  We  propose  to  sequentially 
combine  two  subspace  CFAR  detectors,  one  to 
detect  the  driven  response  and  the  other  to  detect 
the  resonance  response.  A  neural  network  is  uti¬ 
lized  to  fuse  the  two  responses.  Preliminary  results 
corroborate  the  adequacy  of  the  method. 

1.0  Introduction 

It  is  well  known  that  man-made  metallic  objects  in 
UWB  radar  produce  a  delayed  damped  sinusoidal 
response  following  the  large  energy  reflection 
called  the  driven  response.  The  driven  response  is 
also  generated  by  non-metallic  objects  such  as 
trees  or  foliage.  CFAR  algorithms  based  on  gener¬ 
alized  likelihood  ratio  test  (GLRT)  have  been  pro¬ 
posed  for  detecting  targets  resonance  response  in 
UWB  SAR  images  [Yen  and  Principe,  1997a]  [Yen 
and  Principe,  1997b],  but  they  utilize  only  either 
the  resonance  response  information  or  a  combina¬ 
tion  of  the  resonance  response  and  the  spatial 
extent  of  the  target.  However,  the  large  energy  con¬ 


tained  in  the  driven  response  of  non-metallic 
objects  like  foliage  degrades  the  performance  of 
these  CFAR  detectors. 

In  this  paper  we  propose  the  strategy  of  temporally 
combining  two  CFAR  detectors  [Wang  and  Chel- 
lapa  1994],  one  for  detecting  the  early  driven 
response  and  the  other  for  detecting  the  delayed 
resonance  response.  The  two  detections  are  further 
fused  by  a  neural  network.  Although  some  algo¬ 
rithms  were  proposed  to  “optimally”  fuse  several 
statistically  independent  detections  [Chair  and 
Varshney,  1996],  they  are  either  only  optimal  for 
fixed  threshold  local  detectors  or  for  unrealistic 
assumptions  about  data  distributions.  In  this  paper, 
the  previous  fusion  rule  in  are  extended  using  the 
neural  network,  so  that  all  the  weights  and  thresh¬ 
old  could  be  adapted  by  the  training  data  to  per¬ 
form  better. 

In  section  II,  we  discuss  the  temporal  template 
model  for  target’s  driven  response  and  resonance 
response.  In  section  III,  we  formulate  the  fusion  of 
the  two  detectors’  results.  In  section  IV,  the  experi¬ 
ments  are  presented. 

2.0  The  Target’s  Temporal  Template  Model 
and  GLRT  implementation 

It  is  known  that  the  ideal  response  x  of  a  metallic 
object  in  UWB  SAR  image  can  be  temporarily 
decomposed  into  two  components:  the  driven 
response  and  the  resonance  response.  The  driven 
response  always  precedes  the  resonance  response. 
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which  only  contains  damped  sinusoids.  The  total 
response  JC  can  be  divided  in  time  as 

X  =  [.J  (1) 

,  where  is  the  N^x  I  driven  response,  and 
isN^x  1  resonance  response.  Let  be  an  iV^  x  1 
column  vector  in  the  down-range  profile  of  the 
UWB  SAR  image.  The  measurement  vector  y  is 
assumed  formed  by  the  total  response  x  and  a 
Gaussian  noise  vector  w.  Obviously, 
N  =  Nj  +  N^,  where  and  are  usually 
target  dependent.  Now,  y  can  be  written  as 


,where  and  are  the  corresponding  noise 
vectors.  It  is  important  to  note  that  for  foliage  or 
non-metallic  objects  only  the  driven  response  y^ 
exists.  The  energy  contained  in  the  driven  response 
is  much  larger  than  that  contained  in  the  resonance 
response,  which  hinders  the  design  of  detectors 
based  on  the  resonant  response. 

Let’s  assume  that  the  ideal  resonance  response  x^. 
belongs  to  a  known  -dimensional  orthogonal 
signal  subspace  represented  hy  aN^x  M^.  matrix, 
.  If  we  apply  linear  transform  to  the  y^ , 
then  we  get 


=  L]:y^  =  S^a^  +  v^ 


,  where  is  the  representation  vector,  and  a 
M  xM  matrix  of  rank  describing  the  loca¬ 
tions  of  known  components.  At  most  one  element 
in  each  row  and  each  column  would  be  equal  to 
one,  and  the  remaining  of  the  elements  are  zero. 

V  =  is  the  colored  Gaussian  noise  vector 

with  zero  mean  and  covariance 

Following  the  approach  in  [Yen  and  Principe, 
1997a],  we  can  show  that  the  GLRT  testing  statis¬ 
tics  for  this  problem  is  given  by 


t 


r 


H, 


Tr  (4) 


,  where  is  the  threshold  for  is  the  cor¬ 

relation  for  the  null  space  of  the  resonant  response 
(dimension  M  -K^).  In  our  UWB  SAR  scenario, 
we  assume  is  unknown,  and  can  be  estimated 
from  the  neighboring  p  x  1  sample  vector  u  by 
cs}  =  {u^u)/p  .  It  can  be  shown  [Yen  and 
Principe,  1997a]  that  the  tested  statistics  t  is  F-dis- 
tributed,  and  the  CFAR  property  can  be  achieved. 
Similarly,  for  the  driven  response,  the  GLRT  test¬ 
ing  statistics  can  be  written 

T’/'v— 1 

ZdQd  Zd  -  Z^d  Qd  Z^d  >  rr 

tj  -  2  ^  a 

Ho 

,  where  is  the  threshold. 

The  accuracy  of  the  GLRT  is  highly  dependent 
upon  the  subspace  utilized  to  represent  the  signal. 
The  most  compact  signal  representation  is  obtained 
when  the  basis  are  the  eigenfunctions  of  the  signal 
we  are  dealing  with.  Due  to  the  fact  that  the  reso¬ 
nant  response  is  a  linear  combination  of  decaying 
sinusoids,  we  proposed  to  utilize  the  Laguerre 
functions  to  implement  the  bases  of  the  projection 
space  [Yen  and  Principe,  1997a].  The  k-th 
Laguerre  sequence  /,(«,  u)  is  obtained  by  apply¬ 
ing  the  Gram-Schmidt  orthogonalization  to  the  fol¬ 
lowing  exponential  sequence  [Mahammad  and 
Ahmed,  1991] 

/•(«,  u)  =  (6) 

The  Laguerre  sequences  are  a  complete  set  of 
[Mahammad  and  Ahmed,  1991]  and  their  Z  trans¬ 
form  display  the  following  recursive  relation 


L.+  ](z,  m)  =  Z(z,  w)L,(z,  w) 


L(z,  u)  = 


z  '  —  u 

1  —  MZ“' 


An  added  advantage  of  the  Laguerre  functions  is 
that  they  are  recursively  computable,  yielding  very 
compact  and  computationally  efficient  algorithms 
(0(K)),  where  K  is  the  number  of  basis  (even  better 
that  the  computational  complexity  of  the  FFT). 

The  implementation  of  our  detector  based  on  the 
GLRT  is  shown  in  Figure  1.  The  down  range  pro¬ 
file  is  sent  to  two  Laguerre  delay  lines  that  imple¬ 
ment  the  subspace.  Each  tap  output  is  a  projection 
on  a  basis.  In  order  to  implement  (4)  and  (5),  one 
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has  to  select  what  are  the  taps  that  contain  most  of 
the  information  about  the  resonant  response  (and 
the  driven  response).  These  taps  constitute  the  sig¬ 
nal  subspace  to  implement  the  GLRT.  The  remain¬ 
ing  taps  represent  the  null  space  for  the  signal. 

3.0  Temporal  Detection  Fusion  Using  a 
Neural  Network 

The  sequential  detection  fusion  can  be  viewed  as  a 
two-hypothesis  detection  problem  with  two  indi¬ 
vidual  detectors  for  the  driven  and  resonance 
responses,  respectively.  It  can  be  shown  [Chair  and 
Varshney,  1996]  that  the  optimum  decision  rule  to 
fuse  the  detection  results  and  is  given  by 

/(wj,  U2)  =  sign(aQ  +  a^u^  +  a^u^)  (8) 


where 


Ml  =  signitj-Tj) 

U2  =  sign(t^-T^) 

The  optimum  weights  are  given  by 

Mo  =  log(/>i/Po)  (10) 


=  log 


1-p 


Ml 


l-Pp 
=  log-p — ' 

^Mi 


if  M,.  =  +1 

(11) 

if  M,.  =  -1 


,  where  and  Pp  are  the  miss  detection  proba¬ 
bility  and  ^alse  alarm  probability  for  the  i  th 
detector,  respectively.  However,  the  above  optimal 
fusion  rule  implies  fixed  local  detectors  with  preset 
thresholds.  Moreover  all  the  weights  are  pre-calcu- 
lated  based  on  the  theoretical  signal  distributions 
which  is  unrealistic  for  UWB  SAR.  The  neural  net¬ 
work  approach  we  propose  extends  the  previous 
fusion  rule  in  (8)  with  the  hyperbolic  tangent  func¬ 
tion  instead  of  the  hard-limit  sign  function  yielding 


/’(mj',  M2')  =  tanh{aQ  +  a{u{  +  a2U2)  (12) 

,  where 


Mj'  =  tanh(f^-7’^) 
U2  = 


(13) 


The  advantage  is  that  all  the  weights  and  the 
thresholds  can  be  adapted  using  backpropagation 
[Haykin,  1994]  to  give  better  performance.  The 
overall  optimum  detection  scheme  can  be  imple¬ 
mented  as  shown  in  Fig.  1. 


Figure  1.  GLRT  with  Laguerre  Network 

4.0  Simulation  results 

We  will  show  that  the  subspace  detection  scheme 
presented  in  [Yen  and  Principe,  1997a]  can  be 
largely  improved  using  the  sequential  detection. 
The  data  for  the  simulation  provided  by  ARL’s 
UWB  [McCorkle  and  Nguyen,  1994]  is  a  128x5376 
image.  That  image  contains  a  vehicle  target  around 
sample  3,000,  and  foliage  along  down  range  cell 
over  4,000.  We  use  a  10  dimensional  projection 
space  implemented  by  a  10th  order  Laguerre  delay 
line  with  the  recursive  parameter  p  equal  to  0.4. 
Our  previous  results  show  that  the  resonance 
response  is  within  the  subspace  expanded  by  three 
Laguerre  kernels  with  order  equal  to  5, 6  and  7.  The 
remaining  taps  constitute  the  signal  null  space. 

The  detection  statistics  of  the  usual  GLRT  based  on 
the  ID  resonance  model  implemented  by  (4)  or  (5) 
is  shown  in  Fig  2  (first  3,600  down  range  cells)  and 
Fig  3  (remaining  down  range  cells).  Although  the 
parameter  p  is  not  optimized  and  only  three  out  of 
the  ten  taps  are  used  to  define  the  signal  subspace, 
the  algorithm  is  able  to  detect  the  targets  around 
sample  3000  and  provide  a  low  output  for  the 
clutter.  However,  there  is  also  a  false  alarm  around 
4,000  with  a  response  as  high  as  the  target. 
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Fig.  2  the  detection  statistics  based  on 
the  GLRT  of  the  ID  resonance  model 
along  the  down  range  cells  1801~3600 


Fig.  3  the  detection  statistics  based  on 
the  GLRT  of  the  ID  resonance  model 
along  the  down  range  cells  3601~5376 

The  detection  statistics  of  the  GLRT  based  on  the 
sequential  fusion  scheme  are  shown  in  Fig  4  and  Fig 
5,  respectively.  Comparing  the  detection  statistics 
of  target  in  Fig  3  with  that  of  the  foliage  in  Fig  5,  we 
can  see  that  the  foliage  detection  is  totally 
suppressed,  and  that  there  are  more  detections 
corresponding  to  the  accurate  target  locations. 


Fig.  4  the  detection  statistics  based  on 
the  fused  detector  along  the  down  range 
cells  1801-3600 


Fig.  5  the  detection  statistics  based  on 
the  fused  detector  along  the  down  range 
cells  3601-5376 


5.  Conclusion 

Although  these  are  preliminary  tests,  the  idea  of 
exploiting  the  structure  of  the  UWB  response  from 
metallic  objects  by  fusing  the  driven  response  with 
the  resonant  response  seems  to  improve  the 
accuracy  of  the  focus  of  attention.  In  the  future 
work,  we  will  be  carefully  designing  the  projection 
space  (adapting  the  parameter  p. )  to  extract  most  of 
the  resonance  response,  and  making  optimal 
decisions  for  the  search  of  the  signal  and  noise 
spaces.  We  are  researching  other  kernels  that  match 
even  better  the  characteristics  of  the  driven 
response.  One  aspect  that  we  would  like  to  mention 
is  the  simplicity  of  this  implementation  that  can 
lead  to  on-line  algorithms  for  focus  of  attention  in 
UWB. 
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Abstract 

We  present  and  illustrate  the  use  of  a  bottleneck 
system  for  the  segmentation  and  super-resolution 
of  ISAR  targets.  The  system  is  shown  to  be 
comprised  of  three  basic  subsystems;  a 
compressing  transformation,  a  bottleneck 
processor,  and  a  decompressing  transformation. 
We  describe  each  subsystem  and  discuss  the 
processing  responsible  for  segmentation  and 
super-resolution  within  this  framework.  Results 
using  this  network  are  assessed  and  issues 
regarding  performance  are  introduced. 

1.  Introduction 

Feature  extraction  is  critical  in  many  signal  and 
image  processing  applications  but  our  ability  to 
automatically  extract  features  from  data  is  very 
limited.  In  preselecting  features,  we  rely  too 
much  and  too  often  on  our  apriori  knowledge  of 
the  problem.  This  methodology  can  be 
problematic  when  such  apriori  knowledge  is 
scarce  and  also  hinders  our  ability  to  quantify  the 
quality  of  the  features  chosen.  In  our  opinion, 
feature  extraction  should  be  based  on  self¬ 
organizing  methods  because  the  signal’s  samples 
are  the  only  available  information  source.  Such  a 
methodology  is  encountered  in  prediction  where 
the  input  signal  is  cleverly  utilized  as  a  desired 
response. 

Prediction  is  a  way  of  self-organizing  a  system  for 
time  signals  but  it  is  much  more  difficult  to  apply 
for  images.  The  other  known  principle  of  self¬ 
organization  with  an  implicit  desired  response  is 
auto-association  with  a  bottleneck  layer  [Bourland 
and  Kamp,  1988].  This  can  be  thought  of  as  the 
equivalent  to  prediction  for  images  and  also  gives 
us  a  model  for  the  intrinsic  structure  of  the  data. 


In  essence,  we  seek  to  model  image  data  for  a 
given  class  of  imagery  that  is  “independent”  of  its 
scale.  Such  an  approach  has  been  proposed  in 
[Candocia  and  Principe,  1997]. 

The  idea  of  bottleneck  processing  is  not  new. 
This  type  of  processing  has  had  much  success  in 
the  areas  of  image  compression  [Jain,  1989]  and 
subspace  pattern  recognition  [Oja,  1983].  In 
image  compression,  the  saving  of  a  few  transform 
coefficients  to  represent  data  in  a  compressed 
form  can  be  formulated  as  a  bottleneck  process. 
In  subspace  pattern  recognition,  the  reduction  in 
dimensionality  of  a  signal  is  an  important  practical 
step  to  obtaining  discriminant  functions.  This 
type  of  processing,  though,  is  not  restricted  to 
auto-association.  It  has  seen  use  in  hetero¬ 
association  via  non-symmetric  PCA  [Kung,  1993]. 
The  work  presented  here  makes  use  of  such 
processing  in  a  more  general,  non-traditional 
fashion. 

2.  The  Bottleneck  System 

The  bottleneck  system  (BNS)  as  an  auto- 
associator  is  composed  of  three  basic  components: 
(1)  a  compressing  transformation  which  is 
responsible  for  producing  a  reduced  representation 
of  the  input  signal,  (2)  a  bottleneck  processor 
which  further  processes  the  compressed  signal 
space  and  (3)  a  decompressing  transformation 
which  is  responsible  for  reconstructing  the  input 
signal.  This  is  illustrated  in  fig.  1. 

The  input  to  the  BNS  is  given  by  the  vector  x  and 
the  reconstructed  output  is  denoted  x  .  The  first 
and  last  blocks  of  this  processing  are  the 
projections  that  constitute  the  forward  and  inverse 
transforms  of  a  signal,  respectively.  These 
transforms  need  not  be  linear.  The  only  constraint 
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is  that  the  dimensionality  of  the  input  space  be 
reduced.  The  compressed  input  space  is  denoted  y 
and  this  feeds  into  the  bottleneck  processor 
(BNP).  The  BNP  generates  information  z 
regarding  y  which  could  aid  in  the  reconstruction 
of  x;  it  could  also  receive  additional  information  q 
as  input  to  aid  in  the  generation  of  z.  The 
information  z  and  compressed  input  space  y  then 
feed  into  the  decompressor. 


Con^ressing 

Transformation 

Bottle-neck 

Processor 

r* 

Decompressing 

Transformation 

y\ 

j 

Figure  1.  Block  diagram  of  the  BNS. 

Commonly  used  compression  techniques  make  use 
of  linear  transforms  in  the  first  and  last  blocks  of 
the  BNS  and  the  BNP  is  simply  an  identity 
transformer /  (y)  =  y  with  no  ^  or  z.  In  this  ^se  a 
vector  xeW  is  transformed  to  a  vector y&SfT  via 


information  about  the  brightness  of  a  pixel  and 
eventually  texture.  Here  we  will  work  simply 
with  brightness.  This  brightness  is  directly 
proportional  to  the  amount  of  backscatter  received 
by  the  radar  -  which  is  usually  large  for  metallic 
objects  relative  to  non-metallic  ones.  As  such,  the 
segmentation  problem  is  local  in  nature  and 
should  make  use  of  a  local  brightness  measure. 
This  is  done  by  transforming  local  neighborhoods 
of  our  ISAR  images  into  spherical  coordinates. 
More  specifically,  each  H  x  H  neighborhood  of 
our  ISAR  images  is  regarded  as  a  vector  (in 
Cartesian  coordinates)  in  an  EF  dimensional 
vector  space.  Our  set  of  vectors,  or  ISAR  image 
neighborhoods,  are  then  transformed  to  their 
multi-dimensional  spherical  coordinates.  One  of 
the  coordinates  in  this  representation  is  known  to 
be  the  length  or  norm  of  the  vector.  This  quantity 
is  descriptive  of  the  brightness  of  an  ISAR  image 
neighborhood.  This  is  the  preprocessing 
performed  with  regards  to  our  segmentation 


y=Wx  where  M  <  N.  The  approxim^e 
reconstruction  of  vector  x  is  given  by  x  =  VTy 
where  H  denotes  the  Hermitian  of  W.  In  PCA,  it 
is  known  that  IP  is  a  matrix  whose  rows  are  the  M 
largest  eigen-valued  eigenvectors  of  x’s 
covariance  matrix  andy  is  a  vector  of  the  principal 
components  of  x.  In  transforms  such  as  the  DCT 
or  DFT,  the  rows  of  W  constitute  sinusoidal  basis 
functions  (real  and  complex,  respectively)  for 
which  to  project  x  onto  ^d  y  are  the 
corresponding  transform  coefficients.  The  basis 
functions  used  for  projecting  are  the  ones  that 
yield  the  largest  transform  coefficients. 

3.  Pre-processing  the  Input 

This  paper  addresses  the  ability  of  a  BNS  to  (1) 
segment  target  vs.  back^ound  and  (2)  super¬ 
resolve  a  Im  X  Im  resolution  ISAR  data  set  to  1ft 


problem. 


iiiiriii 

% 

fik 

Figure  2.  High  resolution  ISAR  images,  (left  8) 
training  (right  8)  testing. 


X  1ft  resolution.  These  two  problems  are  very 


different.  The  first  is  a  clustering  (classification) 
problem  and  the  second  is  a  regression  problem. 
To  aid  in  tackling  these  problems  with  the  BNS,  it 
is  important  to  consider  what  pre-processing  of 
the  input,  if  any,  should  be  performed.  The  pre¬ 
processing  step  should  reflect  any  characteristic  of 
the  data  inherent  to  alleviating  the  complexity 
associated  with  the  problem.  Let  us  now  note  that 
any  reference  to  an  ISAR  image  is  referring  to  the 
PWF  transformed  ISAR  target.  An  ISAR  image  is 
thus  real  valued.  The  1ft  x  1ft  resolution  training 
and  test  images  are  illustrated  in  fig.  2. 

It  is  evident  that  segmenting  a  target  versus 
background  in  an  ISAR  image  requires 


The  preprocessing  for  super-resolving  an  ISAR 
image  is  different  from  that  just  described.  The 
backscatter  at  various  points  on  targets  can  be 
quite  similar  -  even  across  a  set  of  different  targets 
situated  at  varying  aspect  angles  relative  to  the 
radar.  This  also  suggests  a  local  approach  to  the 
super-resolution  problem.  What  is  not  clear  at 
present  is  which  set  of  descriptors  (and  for  that 
matter,  pre-processor)  retains  the  most 
information  about  a  class  of  images  across 
resolutions.  In  our  pre-processing,  we  have 
decided  to  normalize  each  vector  (ISAR  image 
neighborhood)  to  unit  length.  The  effect  of  this 
operation  will  be  discussed  in  the  next  section. 
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4.  Defining  the  Blocks  of  the  BNS 

The  framework  for  the  processing  of  ISAR  images 
is  given  by  the  BNS.  Here  we  motivate  and  define 
the  processes  contained  within  each  of  the  blocks 
pictured  in  fig.  1 . 

4.1  The  Compressing  Transformation 

The  process  of  super-resolving  ISAR/SAR 
information  involves  the  increase  of  its  resolution. 
This  process  is  akin  to  that  of  interpolation  in 
images.  Here  we  synthesize  the  lower  resolution 
Im  X  Im  ISAR  set  to  be  super-resolved  by 
decimating  our  original  1ft  x  1ft  ISAR  images  by 
a  factor  of  3  x  3.  These  images  are  illustrated  in 
fig.  3.  Note  that  now  we  have  two  versions  of  the 
same  imagery  with  different  resolutions  to  train  a 
model.  Later  on,  the  model  can  be  used  on  new 
low  resolution  data  to  enhance  it. 
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Figure  3.  Synthesized  low  resolution  ISAR 
images,  (left  8)  training  (right  8)  testing. 

Decimation  is  the  process  of  appropriate  lowpass 
filtering  followed  by  subsampling  [Crochiere  and 
Rabiner,  1981].  Notice  that  this  is  a  non- 
invertible,  non-linear  transformation  which  yields 
a  coarse  representation  of  our  input  images. 
Neighborhoods  are  then  extracted  from  the 
decimated  images.  The  set  of  these 
neighborhoods  will  be  denoted  by  y  =  {yp}  where 
each  yp  is  a  distinct  LT  x  //  neighborhood  from  the 
compressed  training  images  that  has  been 
converted  to  a  vector  by  stacking  the  columns  of 
the  square  neighborhood  (or  matrix)  one  on  top  of 
the  other.  These  neighborhoods  are  samples  of 
the  compressed  input  space  y  alluded  to  in  fig.  1. 
Our  compressing  transformation  is  thus  a 
decimation  by  a  factor  of  3  x  3  followed  by  an 
extraction  of  neighborhoods  from  the  resulting 


images.  This  compressing  transformation  also 
works  well  for  the  segmentation  problem.  It 
further  reduces  speckle  in  our  coarser  ISAR 
images  due  to  the  decimation  (albeit  at  the 
expense  of  image  detail).  However,  the  detail  is 
not  significant  to  this  segmentation  problem. 

4.2  The  Bottleneck  Processor 

It  is  important  to  be  able  to  extract  features  (in  a 
self  organizing  manner)  from  the  compressed  and 
pre-processed  input  space  y.  These  features  serve 
to  establish  the  Af  most  relevant  descriptors  of  this 
space  and  are  subsequently  used  to  partition  it. 
Here,  the  feature  extraction  is  accomplished  via 
vector  quantizion  (VQ)  of  the  neighborhoods  yp  of 
the  compressed  and  pre-processed  training  images 
in  set  y.  A  number  of  VQ  algorithms  exist 
including  k-means,  Kohonen’s  self  organizing 
feature  map  [Kohonen,  1990]  and  the  neural  gas 
algorithm  [Martinez  et  al.,  1993].  The  codebook 
vectors  or  quantization  nodes  q-,  z  =  that 

result  from  VQ  are  the  intrinsic  descriptors  of  y. 
We  denote  the  set  of  quantization  nodes  by  q,  i.e. 
q={qi,...,qM}  and  each  where  K=lf. 

Our  study  makes  use  of  the  BNP  illustrated  in  fig. 
4.  There  are  two  separate  inputs  to  this  block:  q 
and  y  as  previously  discussed.  Clustering 
neighborhoods  yp  based  on  closest  distance  to 
each  q.  results  in  a  hard  partitioning  of  y  into 
regions  that  are  most  correlated.  Specifically,  the 
cluster  C;  contains  those  neighborhoods  yp  of  y 
that  are  closest  to  q^  in  Euclidean  distance.  This  is 
given  in  eqn.  (1). 

Q  =  ->’,[)  (1) 

where  z  =  l,-”,Mand  z^a.  The  single  integer 
output  z  e{1,-”,M}  of  the  BNP  represents  the 
cluster  C  that  a  neighborhood  yp  belongs  to. 

q  — ►  Clustering  _ ^  ^ 

y  — »■  Procedure 

Figure  4.  The  BNP  implemented  for  this  paper. 

The  segmentation  problem  mentioned  needs  only 
M=2  quantization  nodes.  One  node  theoretically 
clusters  neighborhoods  corresponding  to  targets 
and  the  other  corresponds  to  non-target 
neighborhoods  (no  shadows  are  considered). 
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The  super-resolution  approach  makes  use  of  M=30 
quantization  nodes,  i.e.  the  neighborhoods  yp  are 
partitioned  into  30  clusters  which  will  each  be 
super-resolved.  Note  that  (the  vector  for)  each  yp 
clustered  is  of  unit  length  due  to  the  pre¬ 
processing  that  was  performed.  This  form  of  pre¬ 
processing  yields  scale  invariant  neighborhoods, 
i.e.  for  two  neighborhoods  a  and  h,  if  a  «  kb  (k  a 
scalar),  these  two  neighborhoods  have  the  same 
underlying  reflectance  properties,  regardless  of 
the  illumination  in  the  scene.  This  is  an 
assumption  made  in  homomorphic  image 
processing  [Dony  and  Haykin,  1995]  which  may 
not  be  valid  for  our  ISAR  images  (as  alluded  to 
earlier).  The  question  as  to  what  pre-processor  to 
use  for  super-resolution  needs  further  addressing. 

4.3  The  Decompressing  Transformation 


The  decompressing  transformation  is  not  needed 
for  the  segmentation  problem;  the  output  z  of  the 
BNP  describes  the  cluster  a  neighborhood 
corresponds  to.  Each  neighborhood  in  the  ISAR 
image  is  thus  assigned  to  the  target  or  non-target 
cluster. 


The  decompressing  transformation  is  obviously 
needed  for  the  super-resolving  of  ISAR  images. 
The  neighborhoods  yp  have  been  clustered  into 
Af=30  groups.  Our  decompressing  transformation 
is  composed  of  A/=30  individual  affine 
transformations  {Wz,  B-}  -  each  tailored  to  the 
specific  information  contained  in  cluster  C^, 
z  =  .  z  now  is  essentially  a  “pointer”  or 

indicator  as  to  which  individual  transformation  to 
use.  The  reconstruction  of  a  neighborhood  Xp  is 
accomplished  by: 


■uvei 


+  (2) 


where  IT-  and  Br  are  the  weight  matrix  and  bias 
vector  associated  with  an  affine  transformation, 
Mvec()  undoes  the  vectorizing  operation  that  was 
performed  to  the  neighborhoods  yp  in  set  y  and  the 
2-norm  of  is  used  to  restore  the  length  of  the 
vector  which  was  removed  during  pre-processing. 
Details  concerning  the  individual  transformations 
are  discussed  in  [Candocia  and  Principe,  1997]. 

5.  Results 

The  results  presented  here  utilized  ISAR  targets 


*  a  K  /ti)  is  our  short  hand  notation  for  |{a  -  kb^2  ^  where 
f  >  0  and  small  and  a,b  are  vectors 


that  were  PWF  transformed  from  the  TABILS  24 
data  set.  The  resolution  of  this  data  set  is  1ft  x 
1ft.  We  chose  8  targets  for  each  of  our  training 
and  test  sets  spanning  180°  of  aspect  angles.  The 
difference  between  target  aspect  angles  in  each  set 
was  22.5°.  The  corresponding  Im  x  Im  low 
resolution  training/test  data  was  simulated  through 
decimation  of  the  high  resolution  training/test  data 
as  discussed  earlier.  Neighborhoods  of  5  x  5 
(H=5)  were  utilized  in  the  extraction  of  features 
both  for  the  segmentation  and  super-resolution 
examples.  The  features  automatically  found 
through  clustering  the  low  resolution 
neighborhoods  for  the  purpose  of  target 
segmentation  are  illustrated  in  fig.  5. 


Figure  5.  Features  extracted  for  target 
segmentation,  (left)  target,  (right)  non-target. 

These  features  have  been  scaled  to  visually 
enhance  the  structure  associated  with  each  feature. 
The  number  in  parenthesis  indicates  the  8-bit  gray 
level  difference  between  the  brightest  and  darkest 
value  in  each  feature.  Notice  that  the  extracted 
feature  corresponding  to  targets  has  a  peaky 
center.  This  is  consistent  with  the  notion  that 
ISAR  targets  are  characterized  by  bright  point 
scatters.  The  non-target  feature  is,  interestingly 
enough,  an  “anti-target”  feature.  It  characterizes 
local  information  that  is  “opposite”  that  of  target 
information.  Fig.  6  illustrates  the  target  vs.  non¬ 
target  segmentation  results. 


Figure  6.  Segmented  low  resolution  images,  (left 
8)  training  (right  8)  test. 
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It  is  important  to  note  that  pre-processing  is 
critical  to  the  types  of  features  extracted  with  the 
self-organized  clustering  scheme.  The  capacity  of 
the  network  to  interpolate  or  super-resolve  the  low 
resolution  representations  to  the  resolution  of  the 
original  images  is  illustrated  in  fig.  7. 


Figure  7.  Super-resolved  images,  (left  8)  training 
(right  8)  test. 

The  pre-processing  of  the  low  resolution  images 
consisted  of  a  simple  normalization  of  the  image 
neighborhoods.  Here,  M=30  features  were 
extracted  from  the  pre-processed  low  resolution 
images.  Note  that  we  are  attempting  to  establish  a 
system  that  recovers  9  times  the  information 
presented  to  it.  The  cropping  effect  is  due  to  not 
super-resolving  image  portions  with  low 
“confidence”.  This  confidence  is  directly 

attributed  to  the  amount  of  available  data  about 
the  location  to  super-resolve. 

The  test  portion  of  fig.  3  with  fig.  7  shows  that  the 
BNS  approach  is  capable  of  capturing  the  salient 
characteristics  of  a  class  of  images  across 
resolutions. 

6.  Discussion  and  Conclusions 

The  self-organization  approach  solved  the 
segmentation  problem  here  with  little  effort.  In 
order  to  segment,  it  was  important  to  have  a  local 
measure  of  brightness  available  to  the  clustering. 
In  fact,  the  coordinate  largely  responsible  for  the 
segmenting  was  the  length  coordinate  in  the 
spherical  coordinate  transformation  utilized.  By 
clustering  on  this  sole  coordinate,  comparable 
results  to  those  of  fig.  6  were  obtained. 

The  bottleneck  approach  extracts  and  models  the 
image  structure  across  resolutions  in  the  image 
set.  The  derived  model  can  then  be  applied  to 
new  low  resolution  images  to  super-resolve  them. 
For  instance,  a  Im  x  Im  PWF  radar  image  of 


targets  could  be  digitally  interpolated  on  the  fly  to 
1ft  X  1ft  using  our  method. 

We  are  still  investigating  many  issues  concerning 
the  ISAR/SAR  super-resolution  problem.  Our 
research  on  optical  images  has  shown  the 
existence  of  highly  correlated  information  across 
scales  and  that  this  information  can  be  exploited 
for  interpolation.  There  is  an  analogous  relation 
to  this  in  the  electro-magnetic  domain  of  SAR 
which  we  also  wish  to  exploit.  Very  probably, 
the  super-resolution  should  be  performed  both  at 
the  complex  and  PWF  transformed  image  levels. 
Also,  questions  as  to  hard  vs.  soft  partitioning  of 
the  low  resolution  space  are  being  examined  as 
well  as  what  pre-processing  is  “most  appropriate” 
for  the  super-resolution  problem. 
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Abstract 

We  present  distance  weighted  geometric  hashing- 
based  indexing  technique  and  a  matching  technique 
using  body  models  and  turret  models  for  the  auto¬ 
matic  target  recognition  of  articulated  objects  in  syn¬ 
thetic  aperture  radar  images.  For  each  target,  360 
body  models  and  360  turret  models  are  built.  These 
models  are  independent  of  the  relative  position  be¬ 
tween  the  body  and  turret.  Four  non-articulated  tar¬ 
gets  (SCUD  missile  launcher,  T-72  tank.  Ml  tank  and 
T-80  tank)  are  used  in  the  indexing  stage  to  build  the 
look-up  table.  In  the  matching  stage.  Ml  tanks  with 
turret  rotated  30®,  60®,  90®  relative  to  the  body  are 
used  as  data. 


1  Introduction 

Recognition  of  articulated  objects  in  SAR  images  is 
a  challenging  problem.  A  simple  approach  may  con¬ 
sider  each  of  the  articulated  parts  of  an  object  as  sep¬ 
arate  objects.  However,  such  an  approach  is  quite 
inefficient  since  it  will  require  a  large  model  database. 
We  want  to  develop  an  efficient  recognition  approach 
that  inherently  models  the  articulated  nature  of  an  ob¬ 
ject  such  as  a  SCUD  missile  launcher  or  a  tank  with 
different  positions  of  its  turret. 

Some  of  the  representative  work  for  target  recog¬ 
nition  using  SAR  images  includes  [2],  [4j  and  [5]. 
These  papers  focus  on  template  matching  techniques 
in  which  the  templates  are  manually  designed.  Recent 
work  on  the  recognition  of  articulated  objects  in  SAR 
images  includes  [1],  [3].  In  this  paper  we  describe  a 
geometric  hashing-based  indexing  with  weighted  vot¬ 
ing  and  a  matching  technique  using  body  models  and 
turret  models.  We  have  evaluated  the  performance  of 
our  initial  approach  using  XPATCH  data. 


‘This  work  is  supported  by  grant  MDA972-93- 1-0010. 
The  contents  and  information  do  not  reflect  the  position 
or  policy  of  the  U.S.  Government. 


Target  Identiflration 


Figure  1:  System  for  recognizing  articulated  targets 
in  SAR  images. 


1.1  Approach 

Figure  1  shows  the  system  for  recognizing  articu¬ 
lated  targets  in  SAR  images.  Our  approach  is  based  on 
local  features  and  local  reference  coordinate  system. 
The  models  for  the  look-up  table  are  constructed  by 
extracting  the  relative  positions  of  the  features  from 
the  non-articulated  training  data.  The  body  and  tur¬ 
ret  models  are  constructed  by  using  three  different  ar¬ 
ticulation  configurations  of  tank  targets. 

Detailed  description  of  the  geometric  hashing  tech¬ 
nique,  specifically  designed  for  SAR,  using  the  look¬ 
up  table  is  given  in  [3].  Distance  weighted  geometric 
hashing-based  indexing  is  an  enhancement  of  the  ba¬ 
sic  geometric  hashing  technique  which  increments  each 
vote  not  by  one,  but  by  max{\dx\,  |dyl)  where  |dx|  and 
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Figure  2:  Indexing  and  matching  components  for  recognizing  articulated  targets. 


|dj/|  represents  the  absolute  values  of  the  relative  dis¬ 
tances  between  two  peak  features  in  the  direction  of 
range  and  cross-range,  respectively.  This  module  gen¬ 
erates  set  of  hypotheses  with  target  identification  and 
pose  which  may  be  the  body  pose  or  the  turret  pose. 

For  the  matching  process,  360  body  models  and  360 
turret  models  are  built  (one/degree  azimuth)  for  each 
tank  target  and  used  to  find  the  positive  and  nega¬ 
tive  point  features  from  the  data  (see  Figure  2) .  The 
positive  features  are  used  to  generate  hypotheses  for 
target  identification  and  body  pose.  Negative  features 
are  used  to  generate  hypotheses  for  turret  pose.  As 
there  might  be  multiple  hypotheses  from  the  indexing 
module,  the  matching  module  will  loop  for  each  of  the 
hypotheses. 

The  basic  assumption  is  that  the  positive  points 
are  from  the  body  part.  The  negative  features  pe 
produced  as  a  result  of  articulation  and  interaction 
between  the  body  and  the  turret.  For  some  targets 
like  the  Ml  tank,  the  turret  part  is  so  large  that  the 
indexed  pose  may  be  the  turret  pose.  To  resolve  this 
problem,  the  Positive-Negative  Feature  Analysis 
stage  uses  body  models  and  turret  models  to  detect 
the  part.  This  stage  uses  the  body  model  with  specific 
target  type  (ID)  and  body  pose  to  detect  the  positive 
features.  If  the  number  of  positive  points  are  larger 
than  a  fraction  of  the  number  of  the  specific  body 
model  points,  then  we  generate  the  turret  hypothesis. 
If  the  number  of  positive  points  are  less  than  a  fraction 
of  the  number  of  the  body  model  points,  we  use  the 
turret  model  for  the  target  ID  and  turret  pose  and  go 
to  the  next  step  to  generate  the  body  hypothesis. 


2  Off-line  Model  Building 

2.1  Extraction  of  scattering  centers  to 
build  non-articulated  model  base 

We  employed  a  simple  method  of  detecting  local 
maxima.  The  method  is  based  on  comparing  the  pixel 
value  with  its  immediate  eight  neighbors.  If  the  cur¬ 
rent  pixel  value  is  greater  than  all  the  other  immediate 
eight  neighbors,  then  it  is  a  local  maximum. 

2.2  Building  non-articulated  model  base 

We  extracted  the  top  fifty  local  maxima  from  the 
images  of  SCUD  missile  launcher  with  missile  down, 
T72  tank.  Ml  tank  and  T80  tank  with  turrets  straight 
to  the  bodies.  The  top  fifty  local  maxima  are  then 
sorted  in  descending  order  of  their  magnitudes  of  SAR 
return  signals. 

build-non-articulated_model_base() 

{  N  =  number  of  local  maxima 

for  (Object  =  1  to  NO_OF_OBJECTS){ 
for  (Angle  =  0  to  359)  { 

model  =  get_model jmage(Object,  Angle) 
peaks  =  extract_local_maxima(model,  N) 
save(Object,  Angle,  peaks) 

}}} 


2.3  Building  body  models 

Figure  3  shows  T-72  tanks  with  turret  0®,  60®,  and 
90®  rotated  relative  to  the  body  whose  pose  is  283®. 
The  fourth  figure  shows  the  body  model  of  T-72  tank 
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Peaks  used  to  build 
T72  body  model 


body  azimuth  :  283 

T72  turret  90  23  Features 

(c)  (“) 

Figure  3:  T-72  body  model  with  azimuth  283”.  (a) 
body  azimuth  =  283°,  turret  straight,  (b)  body  az¬ 
imuth  =  283°,  turret  60°.  (c)  body  azimuth  =  283°, 
turret  90°.  (d)  body  model  at  283°. 


turret  RO 


Figure  4:  M-1  turret  model  with  azimuth  105°.  (a) 
turret  pose  =  105°,  body  azimuth  =  105°.  (b)  turret 
pose  =  105°,  body  azimuth  =  55°.  (c)  turret  pose  = 
105°,  body  azimuth  =  15°.  (d)  turret  model  at  105°. 


with  the  body  azimuth  at  283°.  To  build  this  model, 
first,  conjunction  operation  between  two  images  with 
different  turret  poses  has  been  performed.  This  oper¬ 
ation  generates  three  sets  of  point  features  (0°  &  60° , 
0°  &  90°,  60°  &  90°).  Then,  union  operation  on  these 
three  sets  of  point  features  give  a  set  of  point  features 
which  represent  the  body  model  of  T-72  at  283° .  The 
conjunction  operation  represents  the  best  matching 
between  two  sets  of  point  features.  The  union  opera¬ 
tion  represents  the  union  of  two  sets  of  point  features 
where  one  set  of  point  features  are  translated  appro¬ 
priately  to  have  the  best  matching  between  them.  The 
result  in  3(d)  shows  23  point  features.  There  are  17 
large  dots  in  the  model,  which  represents  the  point 
features  for  the  best  matching  among  the  three  sets  of 
original  point  features. 

2.4  Building  turret  models 

Figure  4  shows  M-1  tanks  with  turret  0°,  60°,  and 
90°  rotated  relative  to  the  body  whose  poses  are  105° , 
45°,  and  15°  respectively.  The  fourth  figure  shows 
the  turret  model  of  M-1  tank  with  the  turret  azimuth 
at  105°.  To  build  this  model,  first,  conjunction  op¬ 
eration  between  two  figures  with  different  turret  pose 
has  been  performed.  This  operation  generates  three 
sets  of  point  features.  Then,  union  operation  on  these 
three  sets  of  point  features  give  a  set  of  point  features 
which  represent  the  turret  model  of  M-1  at  105°.  The 
result  in  Figure  4(d)  shows  15  point  features.  There 
are  7  large  dots  in  the  model,  which  represents  the 
point  features  for  the  best  matching  among  the  three 
sets  of  original  point  features. 


3  On-line  Indexing  and  Matching 

3.1  Distance  Weighted  Geometric 
Hashing-Based  Indexing 

The  paper  by  Jones  &  Bhanu  [3]  describes  the  geo¬ 
metric  hashing  technique  in  detail.  We  have  enhanced 
the  indexing  module  by  incrementing  the  variable  vote 
by  max{\dx\,  |dj/|)  instead  of  incrementing  it  by  one. 
This  new  weighted  voting  scheme  is  different  from  the 
original  non- weighted  voting  scheme  in  employing  the 
relative  distance  between  two  points  as  the  weighting 
factor.  This  approach  improves  the  indexing  results 
as  shown  in  Figure  5. 

3.2  Matching 

Following  algorithm  generates  set  of  hypotheses 
and  finds  the  best  correspondence  between  data  and 
the  set  of  hypotheses. 

Exact  _Matching(dataimage) 

{  data  =  extract_local_maxima(dataimage,  N) 
candidates  =  Weighted_Geometric_Hashing(data) 
for  (each  model  in  top  K  candidates)  { 

Positive JSlegative(data_points,  modeLpoints) 

Sort  the  hypotheses  in  descending  order  of 
the  positive  points 

Select  model  with  the  most  matching  points 

} 
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Figure  5:  Comparison  of  the  two  indexing  scheines 
max{\dx\,  |dy|)  weighting  scheme  and  no-weighting 
scheme  for  the  top  hypotheses. 


Given  two  sets  of  points,  transform  the  first  set  to 
find  the  maximum  number  of  corresponding  points. 
In  this  transform,  only  translation  is  considered  be¬ 
cause  the  rotation  and  scaling  are  taken  care  of  by 
the  design  of  the  recognition  system  and  the  peculiar 
characteristics  of  the  SAR  sensor. 

Algorithm  given  in  Figure  6  shows  how  to  find  the 
positive  and  negative  points.  The  best  case  time  com¬ 
plexity  is  0{MD)  and  the  worst  case  time  complexity 
is  O(MD^)  where  M  and  D  represents  the  number  of 
model  and  data  point  features  respectively. 


4  Experiment  Results 

4.1  Building  Models 

In  model  building,  we  use  four  non-articulated  ob¬ 
jects,  SCUD  missile  launcher  with  missile  down,  T72 
tank,  Ml  tank  and  T80  tank  with  the  turret  straight 
to  the  body.  For  each  non-articulated  object,  we  gen¬ 
erate  360  images  (for  each  degree  in  azimuth)  for  a 
given  depression  angle  of  15° .  From  each  image,  we  ex¬ 
tracted  the  top  50  scattering  centers  with  their  signal 
returns  and  locations  as  point  features  of  the  model. 
So,  the  total  number  of  models  in  the  model  database 
is  1440  (4  non-articulated  objects  *  360). 

4.2  Generating  testing  data 

For  the  testing  data,  we  used  M-1  tanks  with  three 
articulated  turret  positions,  30°,  60°  and  90°  rotated 
relative  to  the  tank  body.  For  each  articulated  posi¬ 
tion,  we  generated  360  images  (  one  for  each  degree 
in  azimuth)  for  a  given  depression  angle  of  15°.  From 
each  image,  we  extracted  the  top  50  scattering  centers 
with  their  signal  returns  as  point  features  of  the  data. 
So,  the  total  number  of  data  in  the  experiment  is  1080 
(3  articulated  objects  *  360). 


GIVEN  :•  2D  model  points  and  Model  Coordinate  System(MCS) 

•  2D  data  points  and  Data  Coordinate  System  (DCS) 

FIND ;  •  maxcount :  the  maximum  number  of  correspondences 
between  model  and  data  points. 

•  correspond :  the  maximum  set  whose  elements  are  pairs 

of  locations  for  #ie  correspondence  between  model  and  data. 

•  positives :  the  data  points  in  the  correspond 

•  negatives  :  the  data  points  which  are  not  in  the  correspond 
procedure  PosHive.Negative  (model_polnts,  data_points) 

place  model  points  on  2D  array, A,  using  MCS 
initialize  maxcount  to  0,  correspond  to  empty  set 

for  (each  model  point  M ) 
for  (each  data  point  D) 

initialize  count  to  0,  corr  to  empty  set 
compute  the  offset  between  M  and  D 
for  (each  data  point  D1) 

apply  the  offset  to  D1  (convert  it  from  DCS  to  MCS) 
if  (there  is  a  model  point  Ml  at  the  same  location  in  A) 
increment  count  and  append  (D1, Ml)  to  the  corr 

If  (count  >  maxcount ) 

update  maxcount  with  count  and 
update  correspond  with  corr 

if  (maxcount  >=  the  number  of  remaining  M )  return 


Figure  6:  Algorithm  for  the  Positive-Negative  Feature 
Analysis. 

4.3  Example  of  positives  from  turret 

Figure  7  shows  an  example  of  the  positive  points  in 
the  data  compared  to  the  non-articulated  model  which 
is  recognized  by  the  turret  pose  instead  of  the  body 
azimuth.  Note  that  the  positive  feature  (Figure  7fcU 
matches  better  with  the  turret  model  (Figure  7(e)) 
then  with  the  body  model  (Figure  7(d)). 

4.4  Discussion 

Figure  8  shows  the  results  for  indexing.  The 
enhanced  indexing  with  weight  maa:(|da;l.  Idyl)  curve 
shows  the  target  ID  and  body  pose  up  to  the  40th 
position  in  the  list  of  hypotheses  sorted  in  descending 
order  of  the  vote.  The  cumulative  percentage  accuracy 
up  to  40th  hypotheses  is  92.87%. 

The  enhanced  indexing  with  positive-negative  curve 
shows  the  cumulative  percentage  of  the  correct  target 
ID  and  body  pose  up  to  the  40th  position  in  the  list  of 
new  hypotheses  sorted  in  descending  order  of  the  num¬ 
ber  of  positive  features  from  the  positive-negative 
feature  analysis  without  the  correction  of  the  confu¬ 
sion  between  body  and  turret.  This  curve  shows  that 
the  positive-negative  feature  analysis  brings  the 
correct  answer  to  the  top  of  the  hypotheses  list  if  the 
correct  answer  is  among  the  top  40  hypotheses  of  the 
indexing  result. 

The  body  detection  by  positive-negative  feature  anal¬ 
ysis  curve  shows  the  cumulative  percentage  of  the  cor¬ 
rect  target  ID  and  body  pose  after  the  correction  of 
the  confusion  between  body  and  turret  using  the  body 
models  and  turret  models.  Based  on  the  top  40  an- 
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NonArtIculated  Model 


DATA 

*•*  i-"’ 

body :  30f  turret :  3t 
(a) 

body :  31°  turret :  3”^ 
(b) 

16  point  features 
(c) 

body  model  for  3^  turret  model  for  3f 
(d)  (e) 


Figure  7:  Example  of  positives  from  the  turret  of  M-1 
tank,  (a)  Data:  body  azimuth  301°  and  turret  pose 
31°.  (b)  Non-articulated  model  hypothesis  generated 
by  indexing:  body  azimuth  31°  and  turret  pose  31°. 
(c)  The  positive  features  of  the  data  detected  by  the 
non-articulated  model,  (d)  M-1  body  model:  azimuth 
31°.  (e)  M-1  turret  model:  pose  31°. 
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Figure  8:  Results  for  Indexing 


swer,  the  correct  target  ID  and  body  pose  increased 
to  97.59%,  which  increases  the  correct  target  ID  and 
body  pose  by  4.74%. 

Figure  9  shows  the  results  based  on  the  top  hy¬ 
pothesis.  In  the  column  of  ID,  Body,  Turret,  0, 
1  and  X  represents  incorrect,  correct  and  don’t  care. 
The  column  Exact  shows  the  result  that  testing  data 
is  recognized  at  exact  pose.  The  column  Within  5 
shows  the  result  that  testing  data  is  recognized  within 
+/-  5°. 

5  Conclusions  and  Future  Work 

In  this  paper,  we  have  presented  the  initial  research 
for  matching.  The  goal  is  to  develop  physically-based 
approaches  having  multiple  representations  (variety  of 
feature  types)  for  matching  to  recognize  articulated 


ID 

Body 

Turret 

Exact 

Within  ±5° 

0 

X 

X 

0.56(7.) 

0.56(7.) 

1 

0 

0 

1.94(7.) 

1.39(7.) 

1 

0 
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0.37(7.) 

0.19(7.) 

1 
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0 

46.20(7.) 

21.02(7.) 

1 
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1 

S0.93(%) 

72.69(7.) 

Exact 

99.44(%) 

97.13(%) 

51.30(%) 

Within +5° 

99.44(%) 

98.98(%) 

72.87(7.) 

Figure  9:  Recognition  results.  These  results  are  based 
on  the  top  hypothesis  only. 

targets  in  SAR  images. 

We  are  developing  integrated  matching  technique 
and  analysis  for  verifying  hypothesis  (target  ID,  body 
pose^  turret  pose)  using  articulation  variants,  posi¬ 
tive/negative  features,  XPATCH  prediction,  surface 
reflector  type  and  relative  geometry  of  parts  of  articu¬ 
lated  objects  (e.g.  M-1  /  T-72).  We  are  investigating  a 
Bayesian  probabilistic  approach  to  combine  the  above 
known  information  in  an  integrated  manner. 
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Abstract 


The  techniques  of  Lie  group  analysis  can  be  used 
to  determine  absolute  invariant  functions  which 
serve  as  classifier  functions  in  object  recognition 
problems.  Lie  group  analysis  is  a  powerful  tool 
for  analyzing  complex  systems  such  as  the  conser¬ 
vation  model  used  in  recent  thermophysical  invari¬ 
ance  (TPI)  research.  We  will  discuss  the  mathe¬ 
matics  of  Lie  groups  and  the  application  to  recog¬ 
nition  problems  (TPI  specifically).  The  experi¬ 
mental  results  will  demonstrate  the  validity  of  the 
methods  and  determine  the  direction  of  future  re¬ 
search.  More  extensive  background  and  results  are 
available  in  an  extended  version  of  this  paper. 

1  Introduction 

In  a  nutshell  here’s  what  these  techniques  provide 
and  how  they  can  be  used  in  classifying  objects: 
Lie  group  analysis  will  determine  if  there  exists 
a  non-trivial  function  $  which  assumes  a  con¬ 
stant  value  on  the  set  of  all  roots  of  an  equation 
/(i)  =  0.  The  form  of  the  equation  remains  con¬ 
stant  regardless  of  which  particular  object  we  are 
measuring  (viewing),  but  some  of  the  coefficients 
in  this  equation  may  (and  generally  will)  change 
depending  upon  the  object  being  viewed,  as  for 
example  when  /(i)  =  0  expresses  a  conservation 
equation.  As  a  result,  the  set  of  roots  will  differ 
depending  upon  the  object  being  viewed.  Corre- 


*For  extended  development  of  the  concepts  in  this  paper 
contact  any  of  the  authors. 


spondingly  the  constant  value  #(i)  will  assume 
a  different  value  depending  upon  the  object  being 
viewed,  thus  permitting  the  use  of  $  as  a  classifier 
function. 

In  section  2,  the  mathematics  involved  are  pre¬ 
sented,  and  in  section  3  these  ideas  are  applied  to 
the  thermophysical  invariance  problem  where  the 
equation  /(/)  =  0  is  a  conservation  statement. 
Finally,  some  of  the  theory  is  confirmed  by  exper¬ 
imental  data  and  future  directions  are  discussed. 

2  Elements  of  Lie  Group  Analysis 

We  explain  the  theory  of  Lie  Group  Analysis  as 
applied  to  an  equation  of  the  form 

fi^  =  0  (1) 

where  z  =  {zi, . . . ,  Zn)  G  5R"  and  /  is  a  differ¬ 
entiable  function,  /  G  C^(5R).  Denote  the  set  of 
roots  of  /  by 

V{f)  =  {zeU^:f{^  =  0}.  (2) 

If  the  differential  df  ^  0  Mz  eV  (/)  then  /  implic¬ 
itly  defines  a  manifold.  We  assume  this  manifold 
to  be  connected^.  Lie  group  analysis  will  deter¬ 
mine  continuous  symmetries  only;  if  the  manifold 
is  not  connected  discrete  symmetries  may  exist 
and  cannot  be  determined  by  the  methods  con¬ 
sidered  here.  An  example  of  a  discrete  symme¬ 
try  is  reflection.  In  the  physical  applications  we 
consider  in  object  recognition  problems,  discrete 

*A  manifold  M  is  connected  if  to  each  pair  of  points  in 
M  there  exists  a  curve  in  M  connecting  the  two  points. 
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symmetries  are  not  an  issue.  The  variables  under 
consideration  vary  continuously. 

The  concepts  and  theory  given  here  can  be  ex¬ 
tended  to  deal  with  differential  equations  -  and 
this  is  where  Lie  group  analysis  is  used  most  often. 
The  generalization  of  these  techniques  to  differen¬ 
tial  equations  is  not  difficult.  See  Olver  [1993]  for 
such  a  treatment. 

In  general,  Lie  group  analysis  is  applicable  for  sys¬ 
tems  of  equations,  however,  any  system  of  equa¬ 
tions  =  0  for  f  =  1, . . . ,  m  can  be  replaced  by  a 
single  equation  /  =  YffLi  9i  =  0  in  the  sense  that 
V{gi, . . .  ,gm)  =  V{f)-  Hence  there  is  no  loss  of 
generality  in  assuming  only  one  equation. 

2.1  Curves  and  Groups  of 
Transformations 

A  curve  in  is  a  differentiable  function 
:  e  i->  (oi, . . . ,  oin) 

where  7  C  is  an  open  interval  and  ai  €  for 
i  =  l,...,n.  A  curve  in  V{f)  is  a  curve  in  5ft" 
whose  image  lies  in  V{f). 

If  . . . ,  <^")  is  a  vector  field  on  5ft"  (so  (p^  = 
then  for  each  fixed  e 

^,  =  (v:i,...,^")€C'(5ft")x...xC'(5ft"), 

n  factors 

so  each  ps  determines  a  transformation  map  of  3?" 
given  by 

p,  :  5ft"  H4  5ft" 

As  £  varies  over  I  this  determines  a  family  of  trans¬ 
formations  {p^}Eei- 

If  we  define  the  evaluation  function  at  z  as 

X  . . .  X  5ft" 

' - ^ 

n  factors 

then  for  a  fixed  e, 

p^{^  =  eg{pe)  =  {pI{^,...,p’^{^)  G  3?” 

As  £  varies  over  I  this  determines  a  curve  by 
p,{z)  =  ez{p»)  :  7  !->■  5ft" 

{pI{^,...,P^{^). 

In  this  definition  z  is  treated  as  a  fixed  constant. 

As  z  varies  over  5ft",  p»{^  determines  a  family  of 
curves,  {p»{z)}g^^n,  one  for  each  point  z  £  5ft". 

The  set  of  transformations  {y’e  (•)}££/  has  a  nat¬ 
ural  binary  operation  defined  on  it  given  by  com¬ 
position 


p,-ps  :  5ft"  ^  5ft" 

:  Z^Pe{ps{^)- 

A  group  of  transformations  (•)}££/  is  a  set  of 
transformations  such  that  the  operation  of  com¬ 
position  satisfies 

i.  associativity,  p^  ■  (ps  •  g^-y)  -  We  ■  g^s)  ■  g>-i 

ii.  there  exist  an  identity  element  p^,  and 

iii.  each  element  in  {(^s(-)}ee/  has  an  in¬ 
verse. 

The  transformation  Pei*)  is  a  parameterized 
transformation  of  5ft".  Since  it  has  a  single  param¬ 
eter,  the  group  of  transformations  (•)}££/  is 
called  a  one-parameter  group  of  transformations. 

A  one-parameter  Lie  Group  is  a  group  which  also 
carries  the  structure  of  a  1-dimensional  differen¬ 
tiable  manifold.  This  additional  structure  on  a 
group  allows  the  ability  to  speak  of  continuity  and 
differentiability. 

2.2  Tangent  Vectors  and  Vector  Fields 

A  tangent  vector  consist  of  a  vector  part  and  a 
point  of  application.  We  denote  a  tangent  vector 
by  V£  =  {vi,V2,...,Vn)g  where  ivi,V2, . . .  ,Vn)  is 
“the  vector  part”  and  z\s  the  point  of  application. 

If  is  a  curve  then  -^\s=ag>e  determines  a  tangent 
vector  at  pa- 

Each  tangent  vector  vg  determines  a  map  by 

d 

■  /H-  —  |e=o/(^+^V^-) 


where 

77^(3?")  =  The  set  of  differentiable  functions  on  5ft". 
For  brevity  we  simply  write 

V£-(/)  =  ^^\e=of{S+S\g) 

It  is  an  easy  exercise  to  show  that  ^g{f)  = 
£\e=of{z  +  £^z)  =  £U=ofig>{e))  for  any  curve 
p  through  the  point  i” satisfying  -^\e=oP''{s)  =  v®. 


Lemma  1  Let  v  =  (ui,  U2, . . . ,  ««)  he  a  vector 
field  and  /  G  (5ft") .  Then 

rn 

i=i 


Proof:  Apply  the  chain  rule. 

By  this  lemma  it  is  meaningfull  to  write 

,,,  ,  a  d  d 

where 
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d 


d 


OZi  dz2 


UZn 


Thus  a  tangent  vector,  and  therefore  vector  fields 
as  well,  can  be  viewed  as  either  an  ordered  n-tuple 
or  as  an  operator.  It  is  this  ability  to  view  tangent 
vectors  (vector  fields)  from  both  perspectives  that 
makes  them  so  powerful. 

2.3  Killing  Fields  and  Infinitesimal 
Generators 


The  set  of  vector  fields  over  3R”  consisting  of  ele¬ 
ments  /  . 

V=  {Vi,V2,...,Vn) 

where 

Vi  =  Vi(z)  e  ^^(3?"). 

form  a  module  over  the  ring  with  scalar 

multiplication  being  componentwise.  Since 

gy  : 


where 

{gy)f  : 

:  z  {g{^y?)f 


the  set  of  all  vector  fields  satisfying 
v(/)  =  0 

form  a  submodule  since 

v(/)  =  0  &  s(/)  =  0  (v+s)/  =  v(/)-|-s(/)  =  0 

and 

v(/)  =  0  and  5r  G  (7^(3?”)  {gy)f  =  0. 


Theorem  2  If  ip,  (z)  is  a  curve  in  V{f)  and  v  is 
a  vector  field  satisfying  =  v^{ps{^)  for  i  = 

l,...,n  then  v(/)  =  0.  Conversely,  if  = 

for  i  =  =  z  eV{f)  and 

v(/)  =  0  then  ip,{z)  is  a  curve  in  V{f). 

The  process  of  solving  the  equations  to  determine 
a  group  of  transformations  determined  by  the  vec¬ 
tor  field  V  is  called  the  process  of  exponentiation. 

vi(f)  =  i-i 

for  i  =  1, . . . ,  n. 

Corollary  3  Let  v  be  a  vector  field  satisfying 
v(/)  =  0.  Each  infinitesimal  generator  of  v  de¬ 
termines  a  curve  in  V  (/) . 


Corollary  4  Let  {(pejegiR  be  a  group  of  transfor¬ 
mations  ofV{f)  determined  by  the  process  of  ex¬ 
ponentiating.  If  f{^  =  0  then  f{(p^{z))  =  0. 


The  conclusion  of  Corollary  4  is  really  just  a  tau¬ 
tology  since  a  group  of  transformations  of  V{f) 
means  if  f* 6  V{f)  then  Pc{^  G  V{f). 

2.5  The  Group  of  Symmetries,  Sv{f) 

We  have  observed  that  for  an  infinitesimal  gener¬ 
ator  of  a  vector  field  v  satisfying  v(/)  =  0,  the 
solution  to 

dipei^  i/  .-A 

-JT  =  g  Po{^  =  Z 


The  elements  of  this  submodule  are  called  the 
killing  fields  of  /.  (In  more  standard  terminol¬ 
ogy,  these  elements  are  annihilators.  The  descrip¬ 
tor  “killing  fields”  is  more  telling  of  there  role  and 
will  be  employed  here.)  A  collection  of  basis  ele¬ 
ments  for  this  submodule  are  called  infinitesimal 
generators. 

Since  the  infinitesimal  generators  form  a  basis 
for  the  killing  fields  of  /,  every  vector  field  v 
such  that  v(/)  =  0,  with  infinitesimal  generators 
. . . ,  can  be  written  uniquely  as 

n—\ 

V  =  1]  g'g" 

t=i 

for  some  5*  G  (3J")  for  i  =  1, . . . ,  n  -  1 

2.4  Computation  of  Groups  of 
Transformations  from  the 
Infinitesimal  Generators 


determines  a  group  of  transformations.  If  G 
C^(3J")  then  y'v*  is  a  vector  field  such  that 
5'*v®(/)  =  0  and  the  solution  to 

d<Pe{^  i  i,  _ 

=9g  (<^e(^)  Pc{z)  =  2 

determines  a  curve  in  V{f),  and  hence  a  group  of 
transformations  of  F(/) .  More  generally,  since  the 
infinitesimal  generators  {p^,. . .,  p'^}  form  a  basis 
for  the  vector  fields  v  satisfying  v(/)  =  0,  then 
for  any  collection  of  functions 

ff‘GC'i(3?”)  i=l,...,n-l 

it  follows  /n-l  \ 

/  =  0 

\i=l  / 

so  the  solution  to  the  system  of  differential  equa¬ 
tions 


Groups  of  transformations  can  be  calculated  from  determines  a  curve  in  V{f),  and  hence  a  group  of 
the  infinitesimal  generators  by  the  following  transformations  of  V{f). 
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The  set  of  all  such  transformations  determined 
by  this  equation  is  the  group  of  symmetries  of 
V{f),  denoted  by  Sv(f)-  Clearly  it  is  the  smallest 
group  containing  all  of  the  groups  “generated”  by 
the  infinitesimal  generators  {7?^, . . . ,  77"“^}  as  sub¬ 
groups.  Furthermore  any  transformation  of  V (/) 
can  be  determined  by  solving  such  a  system  of 
equations. 

2.6  Invariant  Functions  and  their 
Calculation 

Suppose  we  are  given  the  equation  /  =  0.  Let 
r  :  5v(y)  X  y(/)  F(/) 

:  ^  (pe{z) 

be  the  5y(/)-action  on  V{f).  Then  Sv(f)  acts  on 
hom{V{f),k)  in  a  natural  way 

f  :  Sv{j)  x  hom{V{f),^)  hom{V{f),^) 

where 

:  $(<7?s(2)). 

Definition  5  ^477  element  $  €  hom{V{f),^)  is 
an  Sv(f)  -invariant  of  hom{V (/) ,  $  is  invari¬ 

ant  under  the  action  of  Sv{f)  on  hom{V (/),  5R).  In 
otherwords,  the  stabilizer  of  $  is  Sv{f) 

{p,  e  Sv(f)  ;  yj,  *  $  =  $}  =  Sv(f) 

It  is  an  elementary  exercise  in  algebra  to  show 

Theorem  6  Let  Sv{j)  be  a  group  acting  on  a  set 
hom{V{f),^).  An  element  $  G  hom{V{f),^) 
is  an  absolute  Sv(f)-invariant  of  hom{V{f),^)  if 
and  only  if 

Vyje  G  Sv{fy 

Proof.  See  [Arnold  et  al.,  1997]. 

This  necessary  and  sufficient  condition  is  often 
taken  as  the  definition  of  an  absolute  invariant 
function.  Though  the  definition  of  an  invariant 
element  of  the  set  hom{V{f),^)  should  be  ex¬ 
pressed  in  terms  of  the  more  fundamental  action 

r  :  Sv{f)>^V{f)  >->V{f) 

The  following  theorem  gives  a  necessary  and  suffi¬ 
cient  condition  for  such  an  absolute  invariant  func¬ 
tion. 


Theorem  7  Let  rf  for  i  =  I, . . .  ,n  -  I  be  the  in¬ 
finitesimal  generators  for  the  killing  fields  of  f. 
Then  $  G  hom{V{f),^)  is  an  absolute  Sv(f)- 
invariant  function  if  and  only  if  =  0  for 

i—l,...,n-l. 

Proof.  See  [Arnold  et  al,  1997]. 

3  Lie  group  analysis  in  Object 
Recognition 

Several  attempts  at  recognizing  object  material 
types  using  thermophysical  invariance  theory  have 
been  reported  recently.  Lie  group  analysis  has 
been  applied  to  each  of  the  different  models,  in¬ 
cluding  the  true  differential  form  found  in  previ¬ 
ous  papers  [Michel  et  al.,  1997].  The  following 
example  began  with  the  formulation  presented  in 
[Nandhakumar  et  al,  1997],  in  which  the  radiation 
term  was  linearized  and  embedded  into  h.  Further 
modifications  (discussed  below)  simplified  the  Lie 
group  analysis. 

/  =  Wacos&-\-h{Tco-Ts)-\-K  ‘  ~ 

This  model  does  not  contain  the  energy  storage 
term  present  in  the  previous  models.  Removal 
of  this  term  allows  the  conservation  statement  to 
become  a  conservation  of  heat  flux  statement  as 
opposed  to  the  conservation  of  energy  statement 
used  before.  A  key  reason  for  this  fundamental 
shift  is  to  find  a  model  where  the  terms  are  inde¬ 
pendent. 

A  thorough  analysis  of  the  invariants  of  equation 
(3)  requires  the  application  of  Lie  group  analysis. 
Consider  the  conservation  equation  (3)  modeled 
algebraically  by 

“1-2/4  „ 

J/i  +  1/22/3  —  1/2  Oi  +  “2  ^ 


where 


ai  =  Ts 

Surface  temperature 

02  =  k 

Thermal  conductivity 

yi  =  W  a  COS0 

Solar  absorption 

y2  =  h 

Heat  transfer  coefficient 

ys  =  Too 

Ambient  temperature 

1/4  =  Tint 

Internal  temperature 

1/5  = 

Depth  into  the  material 

(along  the  path  of  conduction) 

The  tti  variables  are  measureable  (or  guessed)  in  a 
recognition  scenario  and  the  yi  variables  are  not. 
Ideally  we  would  like  to  find  a  function  of  the  a,- 
variables  which  is  an  invariant. 

In  general,  W  can  not  be  measured,  while  cicos© 
can  be  estimated.  However,  for  the  experiment 
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discussed  in  the  next  section  the  entire  term, 
lyofcos©,  is  measured.  Also,  02  and  are  con¬ 
stant,  therefore  we  will  have  4  transformation 
groups  after  using  the  equations  presented  in  Sec¬ 
tion  2. 


Generator 

Transformation 

Vi 

yi  2/1  +  e 
^ 

V2 

y2  ^  (2/2  -  +  § 

«i  (ai  -  y^)e  '  -fys 

V3 

y3^y3  +  £ 
ai  ai  H — — e 

^  !/2!/s-a2 

V4 

yi^yA  +  e 

fli  f — y  (Xi  “1 - ^ — s 

_ 1 ^  j/2?/5-a2 

Table  1:  Infinitesimal  Generators  and  the  cor¬ 
responding  Groups  of  Transformations. 
Note:  the  variables  not  listed  un¬ 
der  a  Transformation  Group  undergo 
the  identity  transformation.  All  these 
transformations  are  global  Lie  groups. 

The  only  function  invariant  under  all  the  transfor¬ 
mation  groups  is 

^  =  9[yi  +  y2  y3-  y2  ai  +  0,2  — — —]  (5) 

2/5 

where  g  is  an  arbitrary  function.  Hence,  analyt¬ 
ically,  there  are  no  non-trivial  invariant  functions 
for  (3)!  It  remains  to  be  determined  if  additional 
constraints  can  be  found  empirically  such  that  use¬ 
ful  quasi-invariants  can  be  found. 

4  Experimental  Validation  of  the 
Group  of  Transformations 

To  check  the  groups  of  transformations  found  in 
the  above  application,  experimental  data  from 
a  thermocouple  data  collection  performed  at 
Wright-Patterson  Air  Force  Base  was  used  to  de¬ 
termine  the  transformation  from  one  data  point  A, 
to  another  data  point  B.  The  “ground  truth”  data 
consisted  of  temperature  measurements  acquired 
from  thermocouples  implanted  in  various  types 
of  materials  placed  in  an  outdoor  scene  and  col¬ 
lected  over  a  period  of  2  weeks  in  mid-November. 
The  collection  includes  varying  weather  conditions 
and  has  extensive  records  of  the  atmospheric  pres¬ 
sure,  ambient  temperature,  lighting  conditions, 
etc.  Multiple  temperature  measurements  of  sod, 
clay,  gravel,  concrete,  asphalt,  and  aluminum  were 
recorded  every  15  minutes  and  provide  rough  esti¬ 
mates  of  all  the  variables  in  the  conservation  equa¬ 
tion. 

Currently,  we  measure  and  estimate  all  the  param¬ 
eters  except  h.  Although  we  could  also  estimate 


h,  we  currently  derive  it  from  the  other  estimates 
and  the  conservation  statement.  We  plan  on  esti¬ 
mating  h  in  the  future,  but  for  this  example,  we 
found  it  was  more  useful  to  derive  h  for  two  rea¬ 
sons 

1.  We  can  check  for  reasonable  bounds  on  h  to 
verify  when  our  model  is  working  correctly. 

2.  By  forcing  the  conservation  equation  to  be 
true  at  each  time,  the  transformation  groups 
are  better  illustrated. 

Once  we  have  formed  data  points  for  each  material 
at  various  instances  in  time,  we  can  verify  that  our 
transformation  groups  work  by  solving  for  each  e 
and  applying  it  to  the  surface  temperature  (using 
the  appropriate  transformation).  If  the  transfor¬ 
mations  form  a  group  (as  they  should),  the  con¬ 
servation  equation  will  hold  before  and  after  each 
step.  By  applying  each  of  the  four  transforma¬ 
tions,  we  can  move  between  any  two  points  in  the 
group. 

The  missing  parts  of  figure  1  correspond  to  times 
when  the  physics-based  model  was  determined  to 
break  down.  We  removed  these  points  for  now 
since  the  model  is  not  yet  robust  enough  to  con¬ 
sider  all  the  different  methods  of  heat  transfer.  As 
the  model  is  improved,  we  will  be  able  to  show  re¬ 
sults  for  all  times  and  include  other  factors  such 
as  rain,  shadows,  and  transpiration.  Ideally  an  ex¬ 
tended  time  period  of  data  will  be  used  for  classi¬ 
fication  since  the  material  characteristics  may  be 
masked  at  any  point  by  transient  or  induced  ef¬ 
fects.  Only  after  collecting  an  extended  period  of 
data  could  one  feel  confident  in  a  determination 
of  the  materials  being  viewed. 

As  previously  discussed,  we  forced  the  conserva¬ 
tion  statement  to  hold  by  solving  for  h  at  each 
point.  If  h  is  estimated,  then  the  resulting  con¬ 
servation  statement  will  not  be  exactly  zero,  say 
/(5)  =  5.  The  elements  of  the  group  of  symme¬ 
tries  would  then  satisfy  =  S.  A  classifier 

would  be  designed  to  determine  the  threshold  for 
which  a  point  is  considered  in  the  class  or  outside 
the  class.  This  is  simlar  to  the  hypothesize  and 
verify  scheme  suggested  in  previous  papers  [Nand- 
hakumar  et  ai,  1997].  However,  since  we  can  not 
measure  all  these  parameters,  and  since  we  have 
shown  non-trivial  invariants  do  not  exist,  we  need 
to  look  for  new  formulations  of  the  model  and/or 
quasi-invariants. 

5  Discussion 

5.1  New  physics-based  models 

Another  area  of  research  is  the  model  of  the  con¬ 
servation  equation.  The  current  model  was  de¬ 
rived  to  characterize  “typical”  data,  with  no  claim 
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(a)  (b) 


Figure  1:  (a)  3  days  of  the  solar  radiation  and  conduction  terms  are  shown.  During  the  day,  solar 
heating  is  clearly  a  dominant  effect,  (b)  The  surface  temperature  of  asphalt  before  and  after 
a  2  hour  transformation  is  shown.  The  2nd  curve  is  shifted  back  2  hours  to  show  the  exact 
correspondence  with  the  original  temperature,  thus  validating  the  Lie  group  analysis. 


that  it  is  totally  accurate  or  complete.  The  model 
needs  extensive  revision  and  validation  in  order  to 
accomplish  2  major  goals 

1.  to  include  all  common  materials  in  any  state 
(day/night,  rain/shine,  etc.) 

2.  to  find  a  model  which  is  both  accurate  and 
for  which  non-trivial  invariants  exist. 

Since  the  current  model  clearly  does  not  fully  char¬ 
acterize  all  of  the  data  all  of  the  time,  this  will  be 
our  next  step.  However,  it  is  likely  that  model 
manipulations  will  not  reveal  the  absolute  invari¬ 
ants  we  desire.  Therefore,  we  must  also  continue 
research  into  ways  of  finding  quasi-invariants. 

5.2  Quasi-invariants 

From  section  2.5  it  was  determined  that  any  curve 
in  V  (/)  must  satisfy  the  differential  equation 

^  =  =  ?  (6) 

By  curve  fitting  experimental  data  the  vector 
fields  can  be  determined.  Since  the  vec¬ 
tor  fields  for  i  1  are  known  an¬ 
alytically,  the  scalar  coefficients  gi  6  for 

z  =  1, . . . ,  n— 1  can  be  determined.  If  (empirically) 
there  is  an  absolute  invariant  then  at  least  one  of 
the  coefficients  gi  would  have  to  be  zero.  This 
would  imply  they  lie  in  a  subspace  of  the  module 
determined  by  the  infinitesimal  generators.  This 
could  be  the  result  of  “overlooking”  some  physical 
constraint  that  is  not  accounted  for  by  our  single 
equation  modeling  the  problem  -  the  conservation 
of  energy  equation.  (One  known  condition  we  are 
ignoring  are  any  bounds  on  the  variables.)  Fur¬ 
thermore  the  requirement  that  any  curve  satisfy 
(6)  can  be  used  to  determine  “quasi”  (slowly  vary¬ 
ing)  invariants  using  elementary  functional  analy¬ 
sis.  Locally,  if  for  normalized  infinitesimal  gener¬ 
ators,  the  condition 


ibiioo<<5  (7) 

for  some  i  is  satisfied  then  a  function  $(•)  can 

be  determined  such  that  ^  ■  These 

types  of  invariants  could  be  just  as  useful  in  prac¬ 
tice  as  an  absolute  invariant. 

6  Summary 

The  techniques  of  Lie  group  analysis  provide  a 
powerful  tool  for  determining  absolute  invariant 
functions  which  can  serve  as  classifier  functions  for 
object  recognition  problems.  We  have  applied  this 
analysis  to  the  thermophysical  invariants  problem 
and  we  have  proven  there  are  no  (nontrivial)  ab¬ 
solute  invariant  functions  for  this  model. 
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Abstract 

This  paper  compares  several  measures  for 
matching  binary  images  that  are  based  on  dis¬ 
tance  transforms.  In  such  measures,  the  “on” 
pixels  of  one  binary  image  are  used  as  probes 
to  select  distance  transform  values  of  the  other 
image.  We  compare  several  of  these  measures 
using  Monte  Carlo  techniques,  in  order  to  evalu¬ 
ate  their  effectiveness  for  matching  edge  images. 
The  results  demonstrate  that  the  most  effective 
measures  use  some  form  of  outlier  rejection  to 
discard  large  probe  values.  For  high  probability 
of  detection  levels,  the  most  effective  of  these 
techniques  is  a  Hausdorff-based  measure  which 
uses  a  quantile  of  the  probed  distance  values. 

1  Introduction 

This  paper  compares  the  matching  performance 
of  a  number  of  measures  for  comparing  binary 
images  based  on  distance  transforms.  A  dis¬ 
tance  transform  of  a  binary  image  defines  for 
each  image  pixel  the  distance  to  the  nearest 
“on”  pixel  of  that  image,  using  a  given  dis¬ 
tance  function  such  as  Euclidean  distance  (the 
L2  norm).  In  order  to  match  two  images,  the 
“on”  pixels  of  one  image  are  used  as  probes  to 
select  distance  transform  values  of  the  other  im¬ 
age.  Thus  matching  measures  based  on  dis¬ 
tance  transforms  are  asymmetric,  because  one 
image  is  used  to  select  transform  values  in  the 
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other.  A  number  of  different  measures  are  used 
to  combine  the  selected  distance  transform  val¬ 
ues  in  order  to  determine  the  degree  of  difi’er- 
ence  (or  resemblance)  between  two  binary  im¬ 
ages.  The  particular  measures  that  we  consider 
here  are  defined  in  the  following  section.  Such 
measures  have  formed  the  basis  of  a  number 
of  model-based  recognition  techiques  (e.g.,  [3, 
6]),  where  they  are  used  to  compare  binary 
attributes  extracted  from  image  data.  These 
methods  have  been  employed  in  ATR  systems, 
including  for  the  indexing  module  being  devel¬ 
oped  by  SAIC  under  the  MSTAR  program. 

We  are  interested  in  characterizing  the  manner 
in  which  matching  measures  based  on  distance 
transforms  differ  from  one  another,  in  terms  of 
their  ability  to  correctly  detect  a  distorted  in¬ 
stance  of  a  target  in  clutter.  In  order  to  deter¬ 
mine  the  power  of  different  measures,  we  use 
Monte  Carlo  techniques  to  estimate  Receiver 
Operating  Characteristic  (ROC)  curves  for  each 
measure.  These  curves  give  the  tradeoff  be¬ 
tween  probability  of  detection  and  probability  of 
a  false  alarm  for  the  different  measures,  thus  en¬ 
abling  a  determinination  of  which  measures  per¬ 
form  better  under  which  operating  conditions. 
We  consider  variations  in  the  amount  of  occlu¬ 
sion  of  the  target,  the  amount  of  background 
clutter,  the  type  of  background  clutter  (corre¬ 
lated  “edge  chains”  and  uncorrelated  points), 
and  the  spatial  perturbation  of  the  target  fea¬ 
ture  points. 
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2  Matching  Measures 

We  examine  the  following  four  measures  of  the 
quality  of  a  match  between  a  binary  model 
and  image:  (i)  chamfer  measure,  (ii)  chamfer 
measure  with  truncated  distances,  (iii)  trimmed 
mean  of  distances,  and  (iv)  generalized  Haus- 
dorff  measure.  These  measure  can  all  be  defined 
in  terms  of  the  distance  transform,  dist/(p),  of 
the  binary  image,  1.  This  distance  transform 
can  be  thought  of  as  a  function,  parameterized 
by  /,  that  yields  the  distance  of  any  point  from 
the  closest  feature  point  in  I.  In  the  current 
implementation  we  measure  this  distance  trans¬ 
form  using  the  Li  norm  (city-block  distance), 
although  other  distance  functions  could  be  used. 
Unless  otherwise  noted,  all  of  the  distances  in 
this  paper  wiU  be  measured  with  respect  to  this 
norm  and  the  units  of  all  distances  will  be  im¬ 
age  pixels.  If  we  limit  ourselves  to  the  points 
on  a  discrete  grid  (such  as  image  pixels),  the 
distance  transform  of  an  image  can  be  com¬ 
puted  efficiently  using  a  two-pass  algorithm  [7, 
2,  6]. 

Let  M  be  the  set  of  “on”  pixels  of  a  binary 
model  at  some  location  in  the  image  coordinate 
system.  That  is,  in  the  following  definitions  we 
we  assume  that  the  pixels  in  the  model  have 
already  been  transformed  (translated,  rotated 
and  scaled)  to  the  position  that  we  wish  to  com¬ 
pare  against  in  the  image.  In  addition,  the  coor¬ 
dinates  of  each  pixel  are  rounded  to  the  closest 
integral  value. 

Chamfer  measure:  The  chamfer  measure  is 
often  given  as  the  sum  of  the  distances  from 
each  pixel  in  the  model  to  their  closest  image 
edge  pixels  [l].  We  instead  use  the  mean  of  the 
distances.  This  variation  yields  differences  only 
when  results  are  compared  between  different  ob¬ 
ject  models, 

chamf(M,/)  =  ^  dist  j(m)  . 

Chamfer  measure  with  truncated  dis¬ 
tances:  It  is  possible  to  reduce  the  effect  of 
occlusion  by  modifying  the  chamfer  measure  to 
be  robust  to  outliers,  by  truncating  the  dis¬ 
tance  transform  such  that  none  of  the  individual 


distances  is  allowed  to  surpass  some  maximum 
value,  dtrunc? 

truncd(M,/)  =  7^  Y,  min(dist/(m),dtrunc)  • 

Trimmed  mean  of  distances:  Another 
method  of  making  the  chamfer  measure  robust 
to  occlusion  is  to  trim  the  largest  distances  be¬ 
fore  summing  them.  Let  d,-  be  the  i-th  smallest 
of  the  probed  distance  transform  values  D  = 
{dist/(m)|m  G  M},  that  is  (di, . .  .,dM)  are  the 
elements  of  D  sorted  into  nondecreasing  order, 
and  let  /  be  the  desired  fraction  of  values  that 
are  summed, 

1  u\m 

trim/(M,/)= -jjiygij  ^  di. 

Generalized  Hausdorff  measure:  Rather 
than  summing  the  distances,  the  generalized 
Hausdorff  measure  selects  the  fth  quantile  value 
of  the  set  of  distances  [3], 

haus/(M,/)=  dist/(m)  . 

For  example,  if  /  =  0.5,  the  generalized  Haus¬ 
dorff  measure  selects  the  median  of  the  dis¬ 
tances. 

3  Estimating  ROC  Curves 

In  order  to  compute  an  ROC  curve  for  a  given 
measure,  we  are  interested  in  estimating  the 
probability  of  detection  and  the  probability  of  a 
false  alarm  over  the  range  of  possible  settings  of 
some  parameter  of  that  measure.  For  the  cham¬ 
fer  measure,  the  only  parameter  is  the  thresh¬ 
old  that  is  used  to  decide  whether  or  not  there 
is  a  match.  The  other  three  measures,  however, 
each  have  an  additional  parameter  besides  this 
threshold: 

•  For  the  truncated  chamfer  measure,  trunc^, 
the  value  d  at  which  the  distance  transform 
is  truncated  wiU  in  practice  attain  only  a 
small  number  of  values.  This  is  because 
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as  d  is  increased  the  measure  quicky  ap¬ 
proximates  the  regular  chamfer  measure. 
Empirical  results  have  indicated  that  when 
there  is  occlusion  this  measure  performs 
best  when  d  is  quite  small.  We  thus  use 
the  fixed  value  d  =  2  for  the  experiments. 

•  For  the  trimmed  chamfer  measure,  trim/, 
it  makes  sense  to  vary  both  the  fraction  / 
and  the  trimmed  mean  distance.  We  thus 
generate  ROC  curves  by  varying  each  of  the 
two  parameters  independently.  In  our  ex¬ 
periments,  we  held  the  fraction  of  pixels  at 
0.8  when  the  mean  distance  threshold  was 
varied  and  held  the  mean  distance  at  0.5 
pixels  when  the  fraction  of  pixels  threshold 
was  varied. 

•  For  the  generalized  Hausdorff  measure, 
haus/,  the  distance  quantile  can  in  prac¬ 
tice  attain  only  a  small  number  of  values. 
This  is  because  the  fraction  /  essentially 
separates  inliers  from  outliers,  and  the  in- 
Uers  all  have  a  small  distance.  Empirical 
results  indicate  that  this  measure  performs 
best  when  the  distance  threshold  is  1,  and 
thus  we  use  this  fixed  value. 

We  have  estimated  ROC  curves  for  each  of  the 
measures  described  above  by  performing  match¬ 
ing  in  synthetic  images  and  using  the  matches 
found  in  these  images  to  estimate  the  probabil¬ 
ity  of  detection  and  false  alarm  over  the  range 
of  possible  parameter  settings.  1000  test  im¬ 
ages  were  used  in  the  experiments,  and  were 
generated  according  to  the  following  procedure. 
Random  chains  of  edge  pixels  with  a  uniform 
distribution  of  lengths  between  20  and  60  pix¬ 
els  were  generated  in  a  256  x  256  image  until  a 
predetermined  fraction  of  the  image  was  covered 
with  such  chains.  Curved  chains  were  generated 
by  changing  the  orientation  of  the  chain  at  each 
pixel  by  a  value  selected  from  a  uniform  dis¬ 
tribution  between  ^  and  The  chains  were 
allowed  to  wrap  around  the  edges  of  the  image. 
Once  the  random  background  was  so  generated, 
an  instance  of  a  model  image  was  placed  in  the 
image,  after  rotating,  scaling,  and  translating 
the  model  image  by  random  values.  The  scale 
change  was  limited  to  ±10%  and  the  rotation 


change  was  limited  to  ±^.  Occlusion  was  sim¬ 
ulated  by  erasing  the  pixels  corresponding  to 
a  connected  chain  of  the  model  image  pixels. 
Gaussian  noise  was  added  to  the  locations  of 
the  model  image  pixels  (<7  =  0.25,  unless  other¬ 
wise  noted).  The  pixel  coordinates  were  finally 
rounded  to  the  closest  integer. 

For  the  experiments  reported  here,  we  per¬ 
formed  recognition  using  the  54  x  54  model  im¬ 
age  shown  in  Figure  1.  An  example  of  a  syn¬ 
thetic  image  generated  using  this  model  image 
and  the  procedure  described  above  is  shown  in 
Figure  2.  In  each  trial,  a  given  matching  mea¬ 
sure  with  a  given  parameter  value  was  used  to 
find  all  the  matches  of  the  model  to  the  image. 
A  trial  was  said  to  find  the  correct  object  if  the 
position  (considering  only  translation  here)  of 
one  of  the  matches  was  within  three  pixels  of 
the  correct  location  of  the  object  in  the  image. 
A  trial  was  said  to  find  a  false  positive  if  any 
match  was  found  outside  of  this  range  (and  that 
match  was  not  contiguous  with  a  correct  match 
position). 

3.1  Summary  of  Results 

Overall  the  experiments  reveal  that  chamfer 
measure  works  the  least  well  of  the  measures, 
especially  when  there  is  any  occlusion  of  the  tar¬ 
get  instance.  This  is  because  the  mean  (or  sum) 
is  quite  sensitive  to  a  small  number  of  moder¬ 
ately  large  distance  values  as  occurs  with  par¬ 
tial  occlusion.  In  the  language  of  robust  statis¬ 
tics,  the  breakdown  point  of  the  measure  is  zero, 
meaning  that  one  arbitrarily  large  value  can 
make  the  entire  measure  be  arbitrarily  large. 
This  is  a  bad  property  when  there  are  outliers, 
as  occurs  with  partial  occlusion.  The  other  mea¬ 
sures  are  all  robust  in  the  sense  that  they  have 
a  nonzero  breakdown  point,  thereby  enabling 
some  number  of  arbitrarily  large  distance  val¬ 
ues  to  be  ignored.  The  second  overall  result 
is  that  the  generalized  Hausdorff  measure  per¬ 
forms  better  than  the  other  measures  when  Pd 
(the  probability  of  detection)  is  high.  Under 
most  conditions  the  operating  range  of  interest 
is  where  the  Pd  is  high,  making  the  generalized 
Hausdorff  measure  the  most  appropriate  one. 
There  is  one  major  exception  to  this  case,  which 
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Figure  1:  The  model  image  used  in  the  exper¬ 
iments. 
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Figure  2:  An  example  of  a  synthetic  image 
that  was  generated  with  curved 
chains  of  pixel  clutter.  This  example 
contains  5%  clutter. 
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Figure  3:  ROC  curves  as  a  function  of  occlu¬ 
sion  (fixed  clutter  5%):  (a)  10%  oc¬ 
clusion,  (b)  15%  occlusion,  (c)  20% 
occlusion,  (d)  25%  occlusion. 


is  for  dense,  uniformly  random,  point  noise.  In 
this  case  the  generalized  Hausdorff  measure  per¬ 
forms  the  worst  of  all  the  measures.  However, 
uniform  random  noise  does  not  generally  occur 
in  practice.  In  real  images,  noise  is  generally 
correlated. 

3.2  Results 

Figure  3  shows  ROC  curves  for  each  of  the 
different  measures,  illustrating  how  the  perfor¬ 
mance  of  the  measures  changes  with  different 
amounts  of  occlusion  of  the  object  model.  For 
each  of  these  curves,  5%  image  clutter  was  used 
and  the  occlusion  was  varied  from  10%  through 
25%.  Levels  of  occlusion  below  10%  are  not 
shown  for  this  level  of  clutter,  since  all  of  the 
measures  perform  weU  for  this  case.  For  10% 
occlusion,  the  generalized  Hausdorff  measure 
performs  the  best,  with  the  chamfer  measure 
and  the  trimmed  mean  measure  (with  a  fixed 
fraction  of  0.8)  performing  the  worst.  As  the 
amount  of  occlusion  is  increased,  the  most  no¬ 
ticeable  difference  is  the  very  poor  performance 
of  the  chamfer  measure,  which  is  the  only  one 


Figure  4:  ROC  curves  as  a  function  of  clutter 
(fixed  occlusion  20%):  (a)  2.5%  clut¬ 
ter,  (b)  5%  clutter,  (c)  10%  clutter, 
(d)  15%  clutter. 


of  the  measures  that  does  not  have  some  means 
of  dealing  with  occlusion.  Another  notable  dif¬ 
ference  is  that  for  at  least  20%  occlusion,  the 
generalized  Hausdorff  measure  performs  worse 
than  any  measure  except  for  the  chamfer  mea¬ 
sure  when  the  the  probability  of  detection  is 
low,  but  it  surpasses  aU  of  the  measures  when 
the  probability  of  detection  is  high.  Since  recog¬ 
nition  methods  are  usually  concerned  with  the 
case  where  the  probability  of  detection  is  high, 
these  experiments  indicate  that  the  generalized 
Hausdorff  measure  will  typically  have  the  best 
performance. 

ROC  curves  illustrating  how  the  performance 
of  the  measures  varies  with  differing  levels  of 
clutter  yield  similar  results  (not  shown  here). 
In  these  experiments,  20%  occlusion  was  used, 
while  the  level  of  clutter  was  varied  for  2.5%  to 
10%.  The  generalized  Hausdorff  measure  per¬ 
forms  best  when  the  probability  of  detection  is 
high  and  the  chamfer  measure  with  truncated 
distances  performs  best  when  the  probability  of 
detection  is  not  high.  Once  again,  the  chamfer 
measure  performs  quite  poorly  when  truncated 
distances  are  not  used. 

An  additional  experiment  tested  how  the  perfor¬ 
mance  of  the  measures  changed  when  the  noise 
added  to  the  object  model  pixels  was  increased. 
Figure  5(b)  shows  curves  generated  for  the  case 
with  5%  clutter  and  20%  occlusion,  but  with 
increased  Gaussian  noise  (<t  =  0.5)  in  the  local¬ 
ization  of  the  object  model  pixels.  The  Haus¬ 
dorff  measure  and  the  chamfer  measure  (which 
is  already  quite  poor)  are  affected  the  least  by 
this  increase  in  the  noise  level  as  can  be  seen  by 
comparison  with  Figure  5(a)  which  shows  the 
a  =  2.5  perturbation  that  was  used  in  the  pre¬ 
vious  experiments. 
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Abstract 

A  purely  geometric  definition  of  Gaussian  curva¬ 
ture  is  used  for  the  extraction  of  the  sign  of  Gauss¬ 
ian  curvature  from  photometric  data.  Consider  a 
point  p  on  a  smooth  surface  S  and  a  closed  curve  y 
on  S  which  encloses  p.  The  image  of  y  on  the  unit 
normal  Gaussian  sphere  is  a  new  curve  |3.  The  sign 
of  Gaussian  curvature  at  p  is  determined  by  the  rel¬ 
ative  orientations  of  the  closed  curves  y  and  p.  The 
relative  orientation  of  two  such  curves  is  directly 
computed  from  intensity  data.  We  employ  three 
unknown  illumination  conditions  to  create  a  photo¬ 
metric  scatter  plot.  This  plot  is  in  one-to-one  corre¬ 
spondence  with  the  subset  of  the  unit  Gaussian 
sphere  containing  the  mutually  illuminated  surface 
normals.  This  permits  direct  computation  of  the 
sign  of  Gaussian  curvature  without  the  recovery  of 
surface  normals.  Our  method  is  albedo  invariant. 
We  assume  diffuse  reflectance,  but  the  nature  of 
diffuse  reflectance  can  be  general  and  unknown. 
Empirical  results,  demonstrate  the  performance  of 
our  technique. 

1.  Introduction 

Surface  curvature  provides  a  unique  three-dimen¬ 
sional,  viewpoint-invariant  description  of  local  sur¬ 
face  shape.  Thus,  curvature  is  a  useful  tool  for 
scene  andysis,  feature  extraction  and  object  recog¬ 
nition  (particularly  if  the  scene  contains  sculpted, 
warped,  free-form  surfaces.)  Estimates  of  local  sur¬ 
face  types,  based  on  the  signs  of  mean  and  Gauss¬ 
ian  curvature,  have  been  widely  used  for  image 
segmentation  and  classification  algorithms  [1,  4,  5, 
7, 14]. 

Extensive  work  has  been  done  in  the  recovery  of 
Gaussian  curvature  from  range  images.  One  tech¬ 
nique  involves  fitting  a  local  surface  [1,  11,  12]  on 
the  range  data  in  order  to  determine  the  partial 

*  This  research  was  supported  in  part  by  ARPA  grant 
DAAH04-94-G-0278,  AFOSR  grant  F49620-93- 1-0484  and 
the  NSF  National  Young  Investigator  Award  IRI-9357757. 


derivatives  necessary  for  the  evaluation  of  Gauss¬ 
ian  curvature.  Another  methodology  recovers 
Gaussian  and  mean  curvature  from  a  collection  of 
directional  curvature  estimates  [4,  7].  However, 
experimental  results  have  found  that  the  resulting 
curvature  estimates,  are  very  sensitive  to  noise  [5, 
6,  9]. 

Gaussian  curvature  can  be  recovered  from  intensity 
images.  Blake  and  Cipolla  [2]  extract  curvature 
along  apparent  contours  for  arbitrary  curvilinear 
viewer  motion.  Woodham  [16]  uses  photometric 
stereo  techniques  to  recover  local  surface  orienta¬ 
tion.  He  can  then  determine  local  curvature  by  tak¬ 
ing  the  partial  derivatives  of  the  image  irradiance 
equation.  Wolff  and  Fan  [3,  14]  recover  the  sign  of 
the  Gaussian  curvature  without  recovering  the  nor¬ 
mal  map.  However,  they  assume  Lambertian  reflec¬ 
tance  and  they  require  some  illumination  planning. 

Our  method  recovers  the  sign  of  Gaussian  curva¬ 
ture  directly  from  intensity  data  resulting  from  dif¬ 
fuse  reflection.  One  of  the  key  ideas  is  not  to 
perform  any  surface  fitting  or  recover  the  surface 
normals.  No  derivatives,  or  local  matrices  and  their 
determinants  are  computed.  We  use  the  construc¬ 
tive  geometric  definition  of  Gaussian  curvature. 
The  Gauss  map  preserves  orientation  of  closed 
curves  around  elliptical  points,  while  it  reverses 
orientation  for  hyperbolic  points.  Points  of  zero 
Gaussian  curvature  generate  a  closed  curve  which 
encloses  a  zero  area. 

Instead  of  examining  the  behavior  of  such  curves 
on  the  Gaussian  sphere,  we  can  study  the  corre¬ 
sponding  curves  formed  in  photometric  space. 
Although  diffuse  reflectance  is  assumed,  the  nature 
of  the  diffuse  reflectance  can  be  quite  general  and 
need  not  be  known.  The  illumination  conditions 
are  also  completely  unknown.  Each  triplet  of  inten¬ 
sity  values  is  a  point  in  photometric  space.  The  col¬ 
lection  of  intensity  triplets  for  all  the  mutually 
illuminated  points  composes  a  photometric  scatter 
plot.  This  scatter  plot  is  in  one-to-one  correspon¬ 
dence  with  the  subset  of  the  Gaussian  sphere  con- 
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taining  the  surface  normals  of  the  mutually 
illuminated  points.  Thus,  the  curve  orientation  test 
for  the  sign  of  Gaussian  curvature  can  be  per¬ 
formed  directly  in  photometric  space.  This  elimi¬ 
nates  the  need  for  recovering  the  surface  normals. 
Although  this  technique  relies  on  the  observed 
brightness  values,  it  is  albedo  invariant. 


wise  manner,  then  the  Gauss  map  is  orientation 
reversing  at  point  p.  The  Gaussian  curvature  at  a 
point  pe  S  is  positive,  K(p)>0  ,  if  the  Gauss 
map  is  orientation  preserving  at  p  and  negative. 
Kip)  <  0  ,  if  the  Gauss  map  is  orientation  revers¬ 
ing  at  p.  For  Kip)  =  0  ,  the  area  enclosed  by  the 
curve  P  is  equal  to  zero. 


2.  Sign  of  Gaussian  Curvature 

Given  an  orientable  surface  S,  the  Gauss  map 
A:5  ->  52  ,  can  be  thought  of  as  taking  the  unit 
surface  normal  at  each  point  p  e  S  and  “translat¬ 
ing”  it  to  the  origin.  Given  a  surface  5,  the  Gauss 
map  provides  a  framework  for  studying  the  surface 
normals  on  S  and  their  directional  changes. 

The  surface  curvature  at  a  point  pe  S  can  be 
measured  by  examining  the  behavior  of  the  surface 
normals  in  a  local  neighborhood  of  p  using  the 
Gauss  map.  Consider  a  small  simple  closed  curve  y 
on  a  surface  S  around  a  point  pe  S  .  The  curve  y 
contracts  to  approach  the  point  p.  Each  point 
^  6  y  has  a  surface  normal  that  is  mapped  on  the 
unit  sphere  5^  .  Consider  now  a  particle  on  the  sur¬ 
face  S  which  is  moving  along  the  curve  yin  a  coun¬ 
terclockwise  fashion.  As  this  particle  moves  on  S 
from  one  point  of  the  curve  y  to  the  next,  it 
traverses  a  curve  p  on  the  Gaussian  sphere  (i.e. 
the  curve  P  is  composed  by  the  endpoints  of  the 
unit  surface  normals  of  the  points  on  y).  See  fig.  1. 


Figure  1  :  (a)  Preservation  and  (b)  reversal  of 
curve  orientation  on  the  Gauss  map 


Curve  P  can  have  only  two  possible  orientations, 
corresponding  to  the  two  possible  directions  of 
motion  along  the  curve.  If  the  particle,  which 
moves  in  a  counterclockwise  fashion  on  y,  traces 
the  curve  p  also  in  a  counterclockwise  manner, 
then  the  Gauss  map  is  orientation  preserving  at 
point  p.  If  the  particle  traces  the  curve  p  in  a  clock¬ 


3.  Photometric  Space 

An  image  is  a  two-dimensional  intensity  pattern.  A 
reflectance  map  R  is  a  means  of  specifying  the 
dependence  of  intensity  values  I  on  surface  orienta¬ 
tion  [8]: 

/  =  Rin)  (1) 

This  map  combines  information  about  surface 
material,  scene  illumination  and  viewing  geometry 
into  a  single  representation  which  determines  the 
image  brightness  as  a  function  of  surface  orienta¬ 
tion. 

Woodham  [15,  16]  observed  in  his  techniques  for 
photometric  stereo  that  one  can  determine  the  sur¬ 
face  normal  as  a  function  of  a  triplet  of  measured 
intensity  values  (/j,  1 2,  ^3)  •  He  showed  that  by 
using  a  calibration  object  of  known  shape  one  can 
build  a  lookup  table  which  maps  measured  inten¬ 
sity  triplets  to  the  corresponding  surface  normals. 
He  explicitly  recovered  the  mapping  between  the 
surface  normals  and  the  specific  reflectance  map. 
The  recovered  mapping  was  not  albedo  indepen¬ 
dent. 

We  never  recover  the  map  between  the  intensity 
triplets  and  the  surface  normals.  We  exploit  only 
the  existence  of  a  one-to-one  correspondence 
between  triplets  of  photometric  values  and  surface 
normals  of  mutually  illuminated  points.  The  spe¬ 
cifics  of  the  image  formation  process  need  not  be 
known  because  the  surface  normals  are  not  recov¬ 
ered.  The  input  to  the  curve-orientation  process  is 
three  successive  images  of  the  same  scene  under 
completely  unknown  illumination  conditions.  The 
only  requirement  is  that  the  direction  vectors  of  the 
three  light  sources  be  non-coplanar. 

For  each  mutually  illuminated  pixel  there  is  a  trip¬ 
let  of  intensity  values  il^,l2’h^-  intensity 
triplet  is  a  point  in  a  three  dimensional  photometric 
space  c  ,  where  each  axis  represents  inten¬ 
sity  from  each  of  the  three  illumination  conditions. 
When  assigning  a  specific  axis  to  a  unique  illumi¬ 
nation  condition  one  should  preserve  the  relative 
ordering  of  the  light  sources.  The  preservation  of 
the  relative  ordering  of  the  light  sources  guarantees 
that  there  is  no  basis  reversd  between  the  axes  of 
the  world  coordinate  system  and  the  axes  of  the 
photometric  coordinate  system. 
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3.1  Scatter  Plot 

Collectively,  the  intensity  triplets  of  all  the  mutu¬ 
ally  illuminated  pixels  generate  a  scatter  plot  in 
4>3 .  Woodham  [15]  showed  that  for  constant 
albedo  Lambertian  surfaces  the  resulting  scatter 
plot  is  a  6-degree-of-freedom  ellipsoid.  In  general, 
diffuse  surfaces  of  constant  albedo  exhibit  gradual 
shading  transitions;  a  small  change  in  the  direction 
of  the  surface  normal  results  in  a  relatively  small 
change  in  the  observed  intensity  value.  Conse¬ 
quently,  the  scatter  plots  of  diffuse  constant  albedo 
objects  are  surfaces  of  positive  Gaussian  curvature. 


(a)  (b) 

Figure  2  :  Photometric  scatter  plots  for  diffuse 
reflectance  spheres  of  constant  albedo. 

For  example,  consider  the  generalizations  of  the 
Lambertian  model  for  rough  [10]  and  smooth  [13] 
surfaces.  Fig.  2  shows  the  scatter  plots  generated 
by  simulating  various  diffuse  reflectance  models. 
Fig.  2(a)  is  created  by  a  sphere  with  a  sand-paper 
texture.  Fig.  2(b)  is  created  by  a  smooth  diffuse 
sphere.  Both  plots  have  positive  Gaussian  curva¬ 
ture  everywhere.  However,  when  a  surface  exhibits 
specular  reflectance,  the  shading  transitions  are  no 
longer  gradual.  Due  to  specularities,  small  changes 
in  the  direction  of  the  surface  normal  might  cause 
abrupt  changes  in  the  observed  brightness  values. 
Thus,  the  scatter  plots  of  specular  reflectance 
objects  are  not  necessarily  surfaces  of  uniform  sign 
of  Gaussian  curvature. 

Let  9:  5^— >0^  be  the  photometric  map  that 
transforms  points  on  the  unit  Gaussian  sphere  5^ 
to  points  in  the  photometric  scatter  plot  .  Typi¬ 
cally,  there  is  a  one-to-one  correspondence 
between  the  intensity  triplets  (/j,  Z,,  Ij)  and  the 
surface  normals  n  of  the  mutually  illuminated  sur¬ 
face  points.  For  constant  albedo  diffuse  surfaces, 
the  photometric  map  (p  is  differentiable  for  all 
points  qe  .  Furthermore,  for  such  surfaces, 
both  the  unit  Gaussian  sphere  5^  and  the  photo¬ 
metric  scatter  plot  are  surfaces  of  uniformly 
positive  Gaussian  curvature.  This  uniformity  of  the 
sign  of  Gaussian  curvature  in  both  5^  and 
makes  the  photometric  map  9  to  uniformly  either 
preserve  or  reverse  orientation  for  all  points  on  . 
The  intrinsic  geometry  of  the  photometric  scatter 
plot  need  not  be  known.  The  photometric  map  9 
will  be  uniformly  orientation  reversing  if  and  only 


if  there  is  a  basis  reversal  between  the  coordinate 
systems  of  the  Gaussian  sphere  and  the  photo¬ 
metric  space  .  However,  by  construction  of  the 
photometric  space,  there  is  no  such  basis  reversal. 

The  composition  of  the  Gaussian  map  N  and  the 
photometric  map  9  creates  a  new  map 
9oiV:S^03  which  directly  transforms  mutually 
illuminated  points  on  the  surface  S  to  points  in  the 
photometric  space  (see  fig.  3.) 


A  constant  albedo 
photometric  scatter  plot 


Figure  3  :  A  surface  5  and  its  maps  on  the  Gauss¬ 
ian  sphere  5^  and  on  the  photometric  space  . 

Formally,  by  the  chain  rule  for  maps,  since  both  the 
Gauss  map  N:  S and  the  photometric  map 
9:  5^  — »  are  differentiable,  the  composite  map 
9oA:5-»0^  is  also  differentiable: 

d{(poN)p  =  (2) 

For  the  composite  map  90A ,  the  Jacobian  deter¬ 

minant  is: 

|rZ(9oA)^|  =  \d(?i^^pj\dNp\  (3) 

However,  the  photometric  map  9  is  orientation  pre¬ 
serving  for  diffuse  reflectance  surfaces.  Thus: 

5ign(|<Z(9oA)p|)  =  sign{\dNp\)  =  sign(K)  (4) 

This  means  that  there  is  no  need  to  recover  the  sur¬ 
face  normals.  We  can  perform  the  curve-orienta¬ 
tion  test  on  the  map  90A  directly. 

3.2  Multiple  Albedo  Surfaces 

So  far  we  have  ignored  the  effect  of  albedo  p. 
When  the  same  surface  normal  occurs  at  surface 
points  which  have  distinct  albedoes,  one  gets  mul¬ 
tiple  triplets  corresponding  to  the  same  surface  nor¬ 
mal.  Tbere  is  no  one-to-one  correspondence 
between  the  intensity  triplets  (Zj,  Z,,  Z3)  and  the 
surface  normals  n  of  the  mutually  illuminated  sur¬ 
face  areas.  However,  gnomonic  projection  of  the 
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scatter  plot  on  a  single  plane  cancels  out  the  effects 
of  albedo  and  enforces  a  one-to-one  mapping 
between  the  Gaussian  sphere  5^  and  the  photomet¬ 
ric  space  . 

4.  Curve-Orientation  Test 

The  general  idea  of  the  curve-orientation  algorithm 
is  to  construct  a  closed  curve  around  each  mutually 
illuminated  pixel  and  test  in  photometric  space 
whether  its  orientation  is  preserved  or  reversed, 
and  whether  it  encloses  an  area  equal  to  zero.  We 
assume  a  right-handed  image  coordinate  system 
with  the  z-axis  pointing  towards  the  viewer.  We 
first  create  a  closed-curve  y  around  each  mutually 
illuminated  pixel  p.  The  curve  y  is  always  traversed 
in  a  clockwise  manner. 

We  then  examine  the  corresponding  curve  p'  that 
is  created  on  the  gnomonic  projection  plane  in  pho¬ 
tometric  space.  The  clockwise  traversal  of  y 
enforces  an  order  in  the  sequence  by  which  the  list 
of  pixels  in  y  is  visited.  We  maintain  this  order 
even  after  the  points  are  mapped  in  photometric 
space.  Assume,  for  example,  that  pixel  p,  pre¬ 
cedes  P2  in  y.Then  =  ((poA)(pi)  will  be  vis¬ 
ited  before  ^2  =  (9o^)(P2)  i”  P'  independent  of 
preservation  of  orientation.  Assume  that  in  the 
closed  curve  y  the  segment  (pj,  P2)  precedes 
(P2>  P3^  ■  *'y  cross-product  of  these  seg¬ 

ments: 

Vy  =  (Pi,  P2)  X  (P2>  P3)  (5) 

where  the  angle  between  the  two  segments  which 
is  facing  the  area  enclosed  by  the  curve  is  less  than 
or  equal  to  n.  The  vector  v  will  always  be  point¬ 
ing  in  the  same  direction  with  the  local  surface  nor¬ 
mal. 


Figure  4  :  (a)  Preservation  and  (b)  reversal  of 
orientation 


Accordingly,  due  to  the  enforced  order,  the  seg¬ 
ment  (qp  ^2)  precedes  the  segment  (<?2,  93)  on 


the  closed  curve  p' .  Let  Vp-  be  the  cross-product 
of  these  segments: 

fp'  =  (?!>  ^2)  X  (?2>  93)  (6) 

where  the  angle  between  the  two  segments  which 
is  facing  the  area  enclosed  by  the  curve  is  less  than 
or  equal  to  7t.  The  vector  Vp/  is  always  parallel  to 
the  normal  of  the  projection  plane.  The  map  tpoA 
preserves  orientation  if  Vp/  points  away  from  the 
origin  of  the  photometric  space  and  reverses  orien¬ 
tation  if  Vp/  points  towards  that  origin  (see  fig.  4.) 

By  definition  in  a  right-handed  triplet  of  vectors, 
the  angle  between  the  two  vectors,  which  are  the 
arguments  of  the  cross-product  operation,  should 
be  less  than  or  equal  to  %.  However,  the  closed 
curve  p'  has  an  arbitrary  shape.  The  angle  frorn  a 
preceding  line-segment  to  a  succeeding  one,  which 
is  facing  the  enclosed  area,  is  not  always  less  than 
or  equal  to  n.  In  such  a  case  the  right-handed  cross- 
product  test  can  not  be  applied.  By  taking  the  con¬ 
vex  hull  of  the  points  that  compose  the  curve  P'  we 
are  guaranteed  that  the  angle  between  all  the  adja¬ 
cent  segments  is  less  than  n.  The  ordering  between 
the  segments  is  maintained  in  the  resulting  convex 
hull. 

We  perform  this  cross-product  test  at  every  point 
that  composes  the  curve  P' .  We  also  examine  the 
eccentricity  of  the  curve  P' .  If  the  eccentricity  is 
high,  then  we  conclude  that  the  curve  encloses  an 
area  that  approaches  zero.  After  the  sign  of  the 
Gaussian  curvature  is  computed  at  every  mutually 
illuminated  pixel,  a  dominance  averaging  of  the 
signs  is  performed.  If  there  is  a  uniform  distribu¬ 
tion  of  positive,  negative  and  zero  curvature  pixels 
in  the  averaging  window,  then  it  is  concluded  that 
this  is  a  zero-curvature  area.  Otherwise  the  central 
pixel  is  assigned  the  sign  that  dominates  the  win¬ 
dow. 

5.  Experimental  Results 

Our  experiments  indicated  that  our  method  is 
insensitive  to  the  size  of  the  window  used  for  gen¬ 
erating  the  curve  on  which  the  orientation  test  was 
performed.  Various  window-sizes  were  tried: 
5x5,  7x7,  and  9  x  9 .  All  produced  comparable 
results  for  both  synthetic  and  real  data.  After  the 
sign  of  the  Gaussian  curvature  was  computed  at 
each  mutually  illuminated  pixel,  the  pixels  were 
colored  according  to  the  derived  sign.  Elliptic 
points  (i.e.  points  of  positive  Gaussian  curvature) 
are  shown  in  medium  grey,  hyperbolic  (i.e.  points 
of  negative  Gaussian  curvature)  in  dark  grey,  ^d 
parabolic  and  planar  (i.e.  points  of  zero  Gaussian 
curvature)  in  light  grey. 

The  experimental  setup  was  not  elaborate.  The 
light-sources  were  not  precision  mounted.  They 
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could  be  placed  anywhere  as  long  as  their  direction 
vectors  were  not  coplanar.  A  Sony  XC-77  camera 
was  used  with  a  25mm  lens.  Cross-polarization 
was  used  in  scenes  that  involved  objects  which 
generated  specularities.  The  objects  were  placed 
about  60cm  from  the  light-sources.  The  curve-ori¬ 
entation  method  was  tested  on  a  variety  of  items 
made  out  of  different  materials  which  had  either 
constant  or  multiple  albedo. 

One  of  the  first  objects  we  tested  was  a  toms  whose 
color  had  been  chipped  away  (see  fig.  5.)  The  toms 
is  a  good  test  object  because  it  has  well-defined 
regions  of  positive  and  negative  Gaussian  curva¬ 
ture. 


Figure  5  :  An  unevenly  painted  toms  and  its  sign 
of  Gaussian  curvature  segmentation 

Among  the  objects  we  tried  was  a  small  vase  with  a 
pattern  of  colorful  ducks  and  flowers.  As  one  can 
see  in  fig.  6  the  recovery  of  the  sign  of  Gaussian 
curvature  is  not  significantly  affected  by  the  color 
patterns  on  the  surface.  An  uneliminated  specular¬ 
ity  at  the  center  of  the  neck  of  the  vase,  causes  an 
erroneous  zero-curvature  classification. 


Figure  7  :  A  dual  albedo  cup  and  its  sign  of 
Gaussian  curvature  segmentation 

In  order  to  test  our  methodology  on  a  surface 
which  has  multiple  albedo  and  is  not  primarily  a 
surface  of  zero  Gaussian  curvature,  we  constmcted 
a  free-form  object  out  of  modeling  clay.  We  used 
yellow  and  magenta  clay  and  created  a  handbag¬ 
like  object.  The  yellow  handle  is  a  rough  approxi¬ 
mation  to  a  segment  of  a  toms.  The  main  body  has 
a  concave  divot  in  the  center.  Its  left  half  is  rela¬ 
tively  flat,  while  its  right  half  is  more  uneven.  Fig. 
8  shows  the  sign  of  Gaussian  curvature  segmenta¬ 
tion  of  our  clay  object. 


Figure  8  :  A  clay  object  and  its  sign  of  Gaussian 
curvature  segmentation 


Figure  6  :  A  colorful  vase  and  its  sign  of  Gaussian 
curvature  segmentation 


Finally,  we  tried  the  tolerance  of  our  algorithm  in 
really  rough  surfaces.  Fig.  9  shows  the  a  small 
wooden  carving  of  a  bear  and  its  segmentation. 
The  flank  is  flat,  but  there  are  surface  variations 
due  to  the  wood  grain  and  the  carver’s  cuts.  These 
are  deep  relative  to  the  size  of  the  object  and  are 
actually  identified  as  hyperbolic  areas. 


Fig.  7  demonstrates  the  performance  of  our  tech¬ 
nique  on  a  dual  albedo  surface  which  has  mostly 
zero  Gaussian  curvature. 
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Figure  9  :  A  wooden  carving  and  its  sign  of 
Gaussian  curvature  segmentation 


6.  Conclusions  and  Future  Research 

We  have  presented  a  direct  method  for  recovering 
the  sign  of  Gaussian  curvature  using  a  curve-orien¬ 
tation  invariant.  Due  to  its  geometric  nature,  our 
technique  is  invariant  to  basis-preserving  affine 
transformations,  as  well  as  albedo.  All  the  compu¬ 
tation  are  performed  in  photometric  space.  How¬ 
ever,  no  assumptions  about  the  model  of  diffuse 
reflectance  are  made.  The  experimental  semp  is 
inexpensive  and  easy  to  duplicate.  The  position  of 
the  light  sources  is  unknown. 

We  are  currently  examining  the  possibility  of  per¬ 
forming  our  curve-orientation  test  directly  on  the 
photometric  scatter  plot.  We  would  like  to  elimi¬ 
nate  the  projection  to  a  plane,  while  maintaining 
the  albedo  invariance.  We  would  also  like  to  gener¬ 
alize  our  technique  to  include  surfaces  that  exhibit 
specular  reflectance.  Finally,  we  are  exploring  the 
possibility  of  using  similar  direct  computation 
techniques  for  the  recovery  of  the  magnitude  of 
Gaussian  curvature  and  the  sign  and  magnitude  of 
mean  curvature. 
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Abstract 

Point-surface  distance  calculations  and  nearest  sur¬ 
face  point  queries  are  essential  to  a  broad  range  of 
geometrical  applications.  Registration,  intersec¬ 
tion  detection  in  simulations  and  navigation  all 
require  fast  solutions  for  these  problems.  It  is  dem¬ 
onstrated  that  ik-dimensional  (A:-D)  trees,  one  of  the 
simplest  and  most  popular  structures  for  point 
location,  exhibit  substantial  performance  decreases 
when  used  on  surfaces  embedded  in  R^.  An  aug¬ 
mented  k-D  tree,  the  face  k-D  tree  is  proposed  that 
greatly  reduces  real  query  cost  for  surfaces  at  the 
expense  of  the  introduction  of  a  bounded  amount 
of  error  in  query  results.  A  side-effect  of  this  aug¬ 
mentation  is  also  presented:  the  fast  generation  of 
reduced  resolution  surface  representations  suitable 
for  visualization. 

1.  Introduction 

Point  location,  the  determination  of  the  closest 
point  or  points  in  a  given  set  of  points  to  an  arbi¬ 
trary  query  location  is  one  of  the  fundamental 
problems  of  Computational  Geometry  and  an 
essential  tool  for  many  tasks  in  Computer  Vision. 
This  problem  is  particularly  important  in  the  regis¬ 
tration/pose-refinement  domain  because  algorithms 
like  Iterated  Closest  Point  (ICP)  [1,  6,  8]  rely  on 
large  numbers  of  nearest-neighbor  queries  to  estab¬ 
lish  and  progressively  refine  inter-object  corre¬ 
spondences.  Any  reduction  in  individual  query  cost 
results  in  a  substantial  performance  gain  for  ICP. 
Gains  can  also  be  realized  for  applications  that 
require  intersection  detection  of  objects  in  Rr  such 
as  battlefield  simulation  and  virtual  environments. 

The  Voronoi  diagram  [4],  is  a  geometric  data  struc¬ 
ture  known  to  have  a  provably  optimal  nearest- 
neighbor  query  time  of  O(logn).  In  R^  it  is  always 
possible  to  construct  the  Voronoi  diagram  for  a  set 
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of  N  points  in  O(NlogN)  time  with  deterministic 
algorithms.  However,  in  R^,  the  Voronoi  diagram 
may  have  size  O(N^)  forcing  worst  case  O(N^)  pre¬ 
processing  time.  This  undesirable  worst-case 
behavior  in  R^  makes  Voronoi  diagrams  unsuitable 
for  applications  handling  large  point  sets.  For  this 
reason  and  other  implementation  concerns,  A:-D 
trees,  are  more  suitable  for  the  point  location  task 
inR^. 

The  k-D  tree  [2]  has  become  a  popular  method  of 
performing  nearest-neighbor,  k-nearest-neighbor 
and  point-surface  distance  queries  in  a  range  of 
computer  vision  applications.  The  ^-D  tree  is  gen¬ 
erated  by  recursive  binary  partitioning  of  space.  At 
each  interior  node  of  the  tree,  space  is  partitioned 
by  a  hyperplane.  The  member  points  of  the  object 
set  are  found  at  the  leaf  nodes  of  this  binary  tree. 
This  paper  will  deal  with  the  specific  case  of  R^ 
and  henceforth  all  examples  can  be  assumed  to  be 
in/?^. 

This  paper  will  show  that  for  the  particular  case  of 
sampled  surfaces  embedded  in  R^,  the  k-D  tree  dis¬ 
plays  undesirable  near-worst-case  behavior  for 
queries  that  are  not  “close”  to  a  point  in  the  object 
set.  The  structure  described  in  this  paper,  the  face 
k-D  tree,  overcomes  this  drawback  of  conventional 
k-D  trees  by  taking  advantage  of  the  a-priori 
knowledge  that  the  object  being  queried  is  a  sur¬ 
face.  The  correction  comes  at  the  expense  of  intro¬ 
ducing  a  strictly  bounded  maximum  potential  error 
into  the  query  results.  A  desirable  side-effect  is 
also  demonstrated:  if  a  triangulation  is  known  for 
the  original  object  surface,  reduced  complexity  tri¬ 
angulations  suitable  for  multi-resolution  visualiza¬ 
tion  can  be  produced. 

2.  ^-D  Trees  Over  Surfaces 

In  order  to  understand  the  shortcomings  of  k-D  tree 
as  a  query  structure  for  surface  data,  it  is  important 
to  understand  how  a  /:-D  tree  is  constructed  and 
queried. 
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The  most  basic  form  of  *-D  tree  subdivides  space 
in  the  following  manner:  First,  one  of  the  principal 
coordinate  axes  (x,  y  or  z)  is  chosen  as  the  normal 
direction  for  the  splitting  plane.  Second,  the 
median  point,  p^,  with  respect  to  this  dimension  is 
determined,  can  be  found  in  linear  time.  The 
plane  P  defined  by  p^  and  the  selected  coordinate 
axis  is  used  to  split  the  space  into  two  halfspaces, 
each  containing  an  equal  number  of  points.  It  takes 
constant  time  to  determine  which  halfspace  a  point 
is  in,  so  this  split  can  also  be  accomplished  in  lin¬ 
ear  time.  The  same  type  of  subdivision  is  recur¬ 
sively  applied  to  the  two  halfspaces  until  the 
number  of  points  in  the  halfspace  reaches  some 
constant  minimum. 

In  order  to  find  the  nearest  neighbor  to  a  given 
query  point  q  the  following  algorithm  is  used. 

Given  a  k-D  tree  T  and  a  query  point  q 
point  nearest_neighbor(7;q) 
if  T  is  a  leaf  node 

find  the  nearest  neighbor  n  in  T  via  brute  force 
return  n 
else 

get  P  the  partition  plane  at  T 
D  :=  distance(Rq) 

//  determine  which  side  of  Pq  is  on 
side  :=  which_side(P,q)  //  ‘leff  or  “right" 

II  find  the  nearest  neighbor  on  this  side 
n  :=  nearest_neigbor(T.side,q) 
if(distance(n,q)  <  D) 

II  no  need  to  search  the  other  child 
return  n 
else 

//  must  search  the  other  child 
m  :=  nearest_neighbor(T.~side,q) 
if(distance(q,m)  <  distance(q,n)) 
return  m 
else 
return  n 


The  worst  case  running  time  of  this  search  algo¬ 
rithm  in  is  0(/V2/3)  [4]  -phis  type  of  behavior 
occurs  when  the  distance  D  to  the  partition  plane  is 
frequently  less  than  the  distance  to  the  nearest 
neighbor  point  in  the  object. 

When  the  object  points  are  clustered  on  a  surface, 
query  slowdown  becomes  evident  over  a  large 
region  of  space.  The  graph  in  fig.  1  shows  the  per¬ 
formance  of  k-D  tree  nearest  neighbor  queries  for 
two  point  sets.  The  first  set  is  a  uniform  random 
distribution  of  points  over  a  the  cubic  region 
[0..100,0..100,0..100].  The  second  set  randomly 
distributes  the  x  and  y-coordinates  over  the  same 
cubic  region,  but  constrains  the  z-coordinate  to  lie 


within  the  range  z=[49.5..50.5],  simulating  a  sam¬ 
pled  planar  surface.  For  each  set,  10,000  queries 
were  performed.  The  x  and  y-coordinates  of  the 
query  points  were  randomly  distributed,  and  the  z- 
coordinate  was  evenly  distributed  over  the  range 
[0..100].  The  average  query  cost  is  plotted  for  each 
value  of  z.  It  is  clear  that  the  query  performance  for 
the  nearly  planar  point  set  approaches  the  perfor¬ 
mance  on  the  random  set  only  in  a  narrow  band 
proximal  to  the  actual  z-range  of  the  points. 


“1000_points_ranclom"  - 

"5000_points_random''  - 

"10000_points_random''  . 

“1 000_points_near_planar” 


0  20  40  60  80  10c 


Query  Point  Z  Value 


Figure  1  :  Query  performance  with  respect  to 
position  on  randomly  distributed  and  planar  data. 


This  behavior  can  be  explained  by  reasoning  about 
the  Voronoi  diagram  of  a  planar  set  of  points 
embedded  in  R^.  The  convex  subdivision  induced 
on  R^  by  the  Voronoi  diagram  can  be  viewed  as  a 
“perfect”  k-D  tree.  It  is  never  necessary  to  check 
the  “outside”  side  of  a  dividing  plane  during  a 
nearest  neighbor  search  as  it  is  in  the  heuristic  k-D 
tree  structure.  Let  us  assume  a  planar  point  set  S  in 
plane  P,  and  its  Voronoi  diagram  V  in  R^.  This  rela¬ 
tionship  is  depicted  in  fig.  2. 

The  planar  boundaries  that  comprise  the  Voronoi 
tessellation  are  projections  of  the  lines  of  the  2-D 
diagram  in  the  plane  parallel  to  the  direction  of  the 
surface  normal  N  of  P.  Assume  that  the  k-D  tree 
query  algorithm  is  provided  with  the  exact  Voronoi 
diagram  as  a  query  structure.  Even  with  this  advan¬ 
tage,  queries  relatively  distant  from  the  plane  will 
be  closer  to  most  of  the  walls  of  the  cell  dian  to  the 
point  enclosed  by  the  cell,  so  the  same  pathological 
behavior  will  be  manifested  even  given  a  perfect 
space  decomposition.  However,  all  hope  is  not  lost 
for  using  k-D  trees  for  point  location  on  surfaces.  If 
it  is  assumed  that  the  query  object  is  a  surface,  it  is 
possible  to  build  a  special  type  of  k-D  tree  data 
structure  that  avoids  the  degeneracies  of  conven¬ 
tional  k-D  trees,  the  face  k-D  tree. 
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Figure  2  :  A  planar  Voronoi  Dmgram  and  its 
extension  into 


3.  Error  Bounded  Query  Structures 

Suppose  it  is  known  that  a  query  object  is  a  sur¬ 
face.  Surfaces  embedded  in  have  the  property 
that  in  local  neighborhoods,  they  resemble  R^,  the 
plane.  It  is  known  that  for  planar  point  distribu¬ 
tions,  Voronoi  diagrams  can  be  constmcted  in 
provably  optimal  O(NlogN)  time  and  queried  in 
provably  optimal  O(logN)  time  [4]. 

Consider  a  set  of  N  points  in  R^  distributed  within 
±e  of  a  plane  P.  Parallel  project  these  points  onto  P 
and  construct  the  2-D  Voronoi  diagram  V  in  P. 
General  R^  nearest-neighbor  queries  can  be  made 
by  parallel  projecting  the  query  point  Q  onto  P  and 
performing  a  2-D  query  on  V.  This  query  returns  an 
approximate  nearest-neighbor.  It  can  be  shown  that 
this  approximate  nearest-neighbor  is  less  than  2e 
more  distant  from  the  query  point  Q  than  the  tme 
nearest  neighbor. 

When  dealing  with  the  entire  surface,  this  type  of 
local  approximate  nearest-neighbor  map  is  useful 
only  in  sub-regions  of  the  surface  that  can  be 
closely  approximated  by  a  plane.  This  raises  the 
issue  of  how  the  surface  should  be  sub-divided. 

3.1  Subdividing  The  Object  Space 

The  approach  that  is  taken  to  subdividing  the  sur¬ 
face  is  similar  to  the  technique  used  for  creating  a 
jfe-D  tree  using  arbitrary  partition  planes.  In  this 
respect,  the  algorithm  shares  some  common  traits 
with  the  work  of  Sproull  [7].  Starting  with  the 
whole  unpartitioned  object,  the  eigenvalues  and 
eigenvectors  of  the  covariance  matrix  of  the  object 
points  are  computed.  The  plane  P^,„  defined  by  the 
eigenvector  corresponding  to  the  minimum  eigen- 
vdue  and  the  center  of  mass  of  the  points  is  a 
planar  approximation  of  the  distribution  of  the 
object  points  in  the  least-squares  sense.  The  dis¬ 


tance  from  each  point  to  P  is  then  measured  and 
the  maximum  recorded.  If  the  maximum  distance 
exceeds  e,  then  object  is  split  by  the  plane 
defined  by  the  eigenvector  corresponding  to  the 
maximum  eigenvalue,  V,;^,  and  the  median  of  the 
points  with  respect  to  the  ordering  induced  by  the 
direction  of  V,^.  The  subdivision  continues  recur¬ 
sively  on  the  two  equally-sized  subsets  which  will 
comprise  the  left  and  right  subtrees  of  the  current 
node. 

This  subdivision  process  halts  when  all  points 
within  the  current  convex  cell  lie  within  epsilon  of 
the  plane  or  the  number  of  points  is  less  than  a 
constant  threshold.  If  the  count  threshold  is 
reached  before  the  epsilon  criterion  is  fulfilled,  the 
points  are  placed  into  a  “bucket”  that  will  be 
searched  by  brute  force  by  the  query  algorithm 
when  this  leaf  is  encountered.  However,  if  the  epsi¬ 
lon  criterion  is  fulfilled  first,  the  points  are  parallel- 
projected  onto  the  plane  P^,„  and  a  2D  Voronoi 
diagram  is  constructed  as  the  query  structure  for 
this  leaf. 

The  cost  of  constructing  the  2-D  Voronoi  diagrams 
at  the  leaves  will  cost  at  most  O(NlogN)  and 
require  0(N)  space.  Therefore  the  construction  of 
this  hybrid  stmcture  is  no  more  costly  than  the  con¬ 
struction  of  a  conventional  k-D  tree.  The  following 
algorithm  more  concisely  details  the  process  for 
constmcting  a  face  k-D  tree. 


/ace_kc/Build_Tree(num_points,points) 

if(num_points  <  threshold) 
return(bucketJeaf(points)) 
create  covariance  matrix 
compute  center  of  mass  {COM} 
compute  eigenvectors  {Vmin.Vmjd.Vmax} 

D  =  max  dist.  from  plane  (Vr^jniCOM, points) 
if(D<E) 

return(voronoiJeaf(Vmin.COM, points)) 
else 

split  points  along  Vmax  (left, right} 
left_child  =Build_Tree(num_points/2,left) 
right_child  =Build_Tree(numj)oints/2, right) 
return  interior_node(left_child,right_child) 


3.2  Querying  Face  k-D  Trees 

The  sole  difference  in  the  query  algorithm  for  the 
face  k-D  tree  and  the  conventional  k-D  query  algo¬ 
rithm  described  previously  occurs  when  a  Voronoi 
leaf  is  encountered.  When  the  query  reaches  such  a 
leaf,  the  query  point  Q  is  parallel  projected  onto  the 
plane  defined  by  the  and  center  of  mass  of  the 
leaf.  A  planar  nearest-neighbor  query  is  then  per¬ 
formed  on  the  Voronoi  diagram  V  of  this  leaf.  The 
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point  P  returned  by  this  query  is  know  to  be  at  most 
2e  more  distant  from  Q  Aan  any  other  point  stored 
at  this  leaf.  The  result  point  P  is  referred  to  as  an 
approximate  nearest  neighbor  to  Q. 

Since  the  distance  to  the  approximate  nearest 
neighbor  is  always  greater  than  or  equal  to  the  dis¬ 
tance  to  the  true  nearest  neighbor,  the  query  algo¬ 
rithm  will  visit  a  superset  of  nodes  that  would  have 
been  visited  had  the  exact  nearest  neighbor  been 
found  for  each  leaf.  Therefore  it  can  be  concluded 
that  the  distance  from  the  query  point  to  its  approx¬ 
imate  nearest  neighbor  is  less  than  2e  greater  than 
the  distance  to  the  true  nearest  neighbor. 

3.3  Performance  Gains 

As  was  observed 
earlier,  conven¬ 
tional  fc-D  trees 
built  over  surfaces 
perform  poorly 
when  query 

points  fall  outside 
of  a  narrow  range 
R  of  space  that  lies 
within  some  dis¬ 
tance  of  the 
surface.  A  face  k- 
D  tree  constructed 
for  a  fixed  e  will 
be  more  efficient 
for  queries  made 
outside  of  R 
because  the  size  of 
the  query  stmcture 
has  been  reduced 
making  the  worst- 
case 

behavior  less 
likely  and  less 
costly  when  it 
does  occur.  However,  the  hybrid  structure  produces 
inaccurate  query  results  for  points  close  to  the  sur¬ 
face.  This  is  not  a  serious  problem  because 
although  the  result  produced  by  the  hybrid  tree  is 
inaccurate,  it  is  accurate  enough  to  determine  with 
high  probability  that  the  query  point  lies  within  R. 
Since  querying  the  conventional  k-D  tree  is  effi¬ 
cient  within  R,  it  is  acceptable  to  perform  exact 
nearest  neighbor  queries  for  those  points  deter¬ 
mined  to  be  close  to  the  surface  by  less  expensive 
approximate  queries. 

By  using  exact  and  approximate  query  trees  in  con¬ 
junction,  it  is  possible  to  retain  accuracy  where 
important  and  to  make  queries  fast  and  less  accu¬ 
rate  when  a  good  estimate  will  suffice.  These  traits 
are  especially  important  for  shape  registration  and 
collision  detection. 


3.4  Relative  Error  Bounded  Queries 

It  is  desirable  to  maintain  a  statistical  bound  on  the 
amount  of  error  in  the  nearest  neighbor  calcula¬ 
tions  returned  by  the  query  structure.  In  this  way  it 
is  possible  for  the  user  of  the  stmcture  to  specify 
the  desired  error  relative  to  the  distance  from  the 
query  point  to  its  nearest  neighbor.  It  is  possible  to 
limit  the  percentage  error  by  setting  a  threshold  for 
the  relative  error  of  approximate  queries: 


'■  rf/st(query  point,  nearest  neighbor)  -  28 

If  the  threshold  is  exceeded,  the  approximate  query 
is  discarded  and  an  exact  query  is  made.  The  graph 
in  fig.  4  demonstrates  the  performance  of  a  hybrid 
query  tree  on  the  same  5000  point  planar  data  set 
used  in  fig.  1.  The  e  used  in  all  cases  was  5.0,  error 
limits  of  10%,  5%  and  2%  are  displayed  as  well  as 
the  cost  of  using  the  conventional  A:-D  tree  query 
stmcture. 

To  further  optimize  query  time,  a  series  of  face  ik-D 
trees  can  be  constmcted  with  varying  values  of  e. 
Given  a  series  of  trees  {Tj..TiJ  constmcted  at  e  val¬ 
ues  {tj-Ek)  (£i>£2>->^k)-  Queries  are  performed 
by  first  obtaining  an  approximate  nearest  neighbor 
distance  D  from  T^.  It  is  known  that  D  is  at  most 
2Eij  greater  than  the  tme  distance.  If  E,.  is  less  than 
the  desired  bound,  the  current  point  is  returned.  If 
Er  is  greater  than  the  desired  bound,  a  higher  reso¬ 
lution  tree  7},  i>k  is  selected  such  that  E^  will  fall 
within  the  bound.  The  query  is  then  repeated  on 
tree  7).  An  empirical  example  of  the  performance 
gain  is  shown  in  fig.  5.  A  series  of  6  face  k-D  trees 
was  constmcted  for  a  76000  point  range  image  of  a 
human  subject  at  resolutions  e  =  {5.0,  4.0,  3.0,  2.0, 
1.0,  0.5}.  k  maximum  query  error  of  2%  was  per¬ 
mitted.  Query  cost  is  plotted  vs.  distance  from  the 
surface  for  a  series  of  100,000  random  queries 
made  within  a  distance  of  200  units  from  the  sur¬ 
face.  The  head  dataset  is  itself  approximately  200 
units  in  size  so  the  queries  fall  within  one  object 
extent  of  the  surface. 

The  absolute  upper  bound  of  2e  is  a  pessimistic 
assumption.  The  tme  error  of  most  queries  will  be 
much  lower  than  2e  as  this  bound  is  derived  from  a 
degenerate  geometric  condition  that  requires  a  high 
local  variance  of  the  surface  that  would  violate 
most  smoothness  or  continuity  assumptions.  Work 
is  underway  to  tighten  the  bound  using  the  signed 
distance  Dp  from  the  query  result  point  to  the  plane 
onto  which  it  was  projected  at  a  Voronoi  leaf.  A 
straightforward  reduction  can  be  made  if  Dp  is  neg¬ 
ative  (meaning  that  the  nearest  neighbor  point  is 
closer  to  the  query  point  than  its  projection.)  In  this 
case,  a  reduction  of  the  bound  from  2e  to  simply  e 
is  possible. 
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gauge  for  real-world  performance. 
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Fi^re  4  :  Comparison  of  query  times  using  face 
Tree  and  conventional  k-D  tree  on  a  planar 
distribution  of  points. 


The  two  examples  shown  in  the  graphs  cannot  be 
interpreted  alone  as  an  indicator  of  performance 
improvement.  However,  similar  speedups  have 
been  observed  with  empirical  data  in  registration 
tasks.  The  performance  gain  is  directly  related  to 
how  much  the  surface  can  be  simplified  under  a 
given  epsilon  bound.  This  simplification  can  be 
measured  by  counting  the  number  of  nodes  in  an 
approximate  query  tree  for  the  given  value  of  epsi¬ 
lon.  Table  1  illustrates  some  simplification  results 
on  empirical  data  sets.  The  head  set  is  a  Cyberware 
scan  of  a  human  head  provided  by  the  CARD  lab  at 
Wright  Patterson  AFB.  The  lung  and  brain  sets 
come  from  the  radiology  department  of  the  Johns 
Hopkins  University.  These  sets  are  normalized  to 
fit  within  a  256x256x256  region  by  a  uniform  scal¬ 
ing.  A  bucket  size  of  10  is  used  for  non-Voronoi 
buckets. 


Table  1 ;  Tree  Size  (nodes)  vs.  e 


e 

Head 

Lung 

Brain 

8.0 

108 

66 

12 

4.0 

279 

154 

62 

2.0 

663 

464 

179 

1.0 

1638 

1170 

255 

0.0 

7600 

1622 

648 

#  points 

76000 

16216 

6472 

The  original  motivation  for  this  research  was  the 
acceleration  of  object  registration.  The  face  k-D 
representation  is  undergoing  systematic  evaluation 
under  query  loads  from  real  registration  problems. 
It  is  expected  that  these  tests  will  provide  a  good 


Distance  From  Surface 

Figure  5  :  Cost  for  uniformly  distributed  que¬ 
ries  vs.  distance  on  empirical  data  using  stan¬ 
dard  k-D  trees  and  a  6  tree  face  k-D  series. 
Maximum  allowable  error  for  queries  was  2%. 


4.  Variable  Level  of  Detail  Surfaces 

An  interesting  side-effect  of  the  hybrid  k-D  tree 
representation  is  that  it  is  possible  to  produce  sur¬ 
face  representations  at  varying  levels  of  detail 
directly  from  the  tree  in  0(N)  time  if  an  initial  sur¬ 
face  triangulation  is  provided  with  the  surface 
point  set.  It  is  the  case  that  the  majority  of  3-D 
point  sets  of  surfaces  are  triangulated  prior  to  use 
in  modeling  or  registration  applications. 

Level  of  detail  (LOD)  reduction  is  important  to  real 
time  modeling  and  simulation  because  rendering 
and  other  costs  can  be  reduced  by  reducing  detail 
for  objects  that  are  not  currently  a  focus  of  atten¬ 
tion  for  the  system. 

As  a  visualization  tool,  face  k-D  trees  function 
much  like  another  face  decomposition  from  the 
graphics  literature,  face  octrees  [5].  The  construc¬ 
tions  are  similar,  but  the  rectilinear  constraints 
imposed  by  the  octree  structure  limit  its  ability  to 
adapt  quickly  to  local  variation.  The  face  k-D  tree 
is  useful  for  LOD  reduction  because  it  simplifies 
low  detail  flat  areas  of  the  surface  into  single  leaves 
while  maintaining  finer  subdivisions  in  regions 

A  brief  description  of  the  simplification  algorithm 
is  as  follows:  First,  the  set  of  centers  of  mass 
(COMs)  of  the  leaf  nodes  is  re-triangulated  based 
on  the  neighbor  connectivity  of  the  original  trian¬ 
gulation.  Logically,  all  of  the  vertices  within  a  sin¬ 
gle  leaf  node  are  merged  to  a  single  vertex  located 
at  the  COM.  If  a  leaf  node  contains  points  that  are 
not  a  connected,  planar  sub-graph  of  the  original 
triangulation,  the  leaf  is  split  until  these  criterion 
are  met.  There  is  no  assurance  of  topology  preser- 
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Figure  6  :  Decreasing  level  of  detail  images  of  a 
human  head,  (data  courtesy  USAF  CARD  Lab) 


vation  in  the  current  implementation,  but  work  is 
progressing  on  integration  of  the  reduced  point  set 
produced  by  the  face  k-D  tree  subdivision  with 
existing  triangulation  techniques. 

5.  Conclusions  and  Future  Research 

This  paper  represents  a  set  of  preliminary  results 
for  applications  of  the  face  k-D  tree  decomposition 
method.  It  is  shown  how  this  structure  provides 
performance  superior  to  that  of  conventional  k-D 
trees  for  point  location/distance  queries  over  sur¬ 
faces  embedded  in  R^.  For  a  set  of  N  points,  the 
cost  of  constructing  the  face  k-D  tree  is  O(NlogN) 
time  using  0(N)  space.  It  is  shown  how  a  face  k-D 
tree  can  be  used  in  conjunction  with  a  conventional 
k-D  tree  to  allow  user  selection  of  the  maximum 
relative  error  allowed  in  any  point  location  query. 

The  research  is  still  in  its  early  stages  and  the  per¬ 
formance  increases  demonstrated  here  are  purely 
empirical.  A  formal  statistical  justification  for  the 
observed  performance  gain  is  forthcoming  as  are 
several  geometrically  motivated  improvements  that 
will  improve  query  times. 
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Abstract 

It  has  long  been  recognized  that,  in  principle  ob¬ 
ject  recognition  problems  can  be  solved  by  simple, 
brute  force  methods.  However,  the  approach  has  gen¬ 
erally  been  held  to  be  completely  impractical.  We 
argue  that  by  combining  a  few  more  or  less  standard 
tricks  with  computational  resources  that  are  histor¬ 
ically  large,  but  completely  feasible  by  recent  stan¬ 
dards,  dramatic  results  can  be  achieved  for  a  number 
of  recognition  problems.  In  particular,  we  describe  a 
resource-intensive,  appearance-based  method  that  uti¬ 
lizes  intermediate-level  features  to  provide  normalized 
keys  into  a  large,  memorized  feature  database,  and 
Bayesian  evidence  combination  coupled  with  a  Hough¬ 
like  indexing  scheme  to  assemble  object  hypotheses 
from  the  memory.  This  system  demonstrates  robust 
recognition  of  a  variety  of  3-D  shapes,  ranging  from 
sports  cars  and  fighter  planes  to  snakes  and  lizards 
over  full  spherical  or  hemispherical  ranges  (and  pla¬ 
nar  scale,  translation  and  rotation).  We  report  the 
results  of  various  large-scale  performance  tests,  in¬ 
volving,  altogether,  over  2000  separate  test  images. 
These  include  performance  scaling  with  database  size, 
robustness  gainst  clutter,  and  generic  ability.  The 
result  of  97%  forced  choice  accuracy  with  full  ortho¬ 
graphic  invariance  for  24  complex  curved  3-D  objects 
over  full  viewing  spheres  or  hemispheres  is  the  best  we 
are  aware  of  for  this  type  of  problem. 

Key  Words:  Object  recognition.  Appearance- 
baaed  representations.  Visual  learning. 

1  Introduction 

Object  recognition  and  indexing  problems  have 
been  some  of  the  most  intensely  studied  in  the  field 
of  machine  vision.  Until  recently,  however,  recogni¬ 
tion  systems,  especially  three  dimensional  ones,  were 
quite  limited  in  their  abilities;  both  in  the  types  of  ob¬ 
jects  they  could  handle,  and  in  the  conditions  under 
which  the  methods  would  work.  It  has  been  recognized 
from  the  beginning  that,  in  principle,  object  recogni¬ 
tion  can  be  solved  using  a  brute-force  approach:  just 
compare  “all”  possible  appearances  of  an  object  di¬ 
rectly  against  an  image.  It  can  even  be  argued  that 


the  complexity  of  such  algorithms  is  linear  in  the  num¬ 
ber  of  objects.  However,  the  constant  factors  in  this 
approach  are  so  large,  (e.g.  10^^  operations  per  rigid 
object  using  no  smarts  =  10,000  pixels  x  100  intervals 
per  DOF  for  6  rigid  and  3  lighting  freedoms) ,  that  the 
approach  was  dismissed  as  completely  infeasible,  and 
work  concentrated  on  the  development  of  efficient  al¬ 
gorithms.  Much  has  been  learned  from  this  work,  but 
what  has  not  emerged  are  algorithms  that  are  efficient 
enough  to  solve  object  recognition  problems  in  any  but 
the  most  limited  contexts,  with  the  sort  of  computer 
power  available  on,  say,  a  1990  desktop. 

One  possible  conclusion  is  that  such  efficient  recog¬ 
nition  algorithms  do  not  exist.  Though  there  could 
be  some  undiscovered  technique  that  will  dramatically 
improve  the  situation,  this  seems  unlikely,  given  the  ef¬ 
fort  focussed  on  the  problem  in  the  last  three  decades. 
Nor  is  there  particular  evidence  for  the  existence  of 
such  efficient  algorithms  in  the  human  brain.  Granted, 
it  performs  certain  recognition  tasks  extremely  well, 
but  estimates  of  the  computational  resources  it  would 
take  to  simulate  the  relevant  neural  processes  range 
from  1  to  1000-1-  tera-ops,  with  1  to  1000  terabytes 
of  stored  information.  Such  estimates  are  extremely 
uncertain,  but  they  are  all  large  compared  to  a  1990 
desktop,  though  small  compared  to  10^^. 

In  the  field  of  computer  science,  the  most  dramatic 
change  has  been  the  phenomenal  increase  in  the  com¬ 
putational  power  of  the  machines  which,  for  a  fixed 
pricem  has  doubled  about  every  18  months  for  three 
decades.  This  phenomenon,  sometimes  referred  to  as 
Moore’s  Law,  has  repeatedly  exceeded  all  expecta¬ 
tions,  and  continues  to  the  current  date,  with  only 
a  few  signs  of  abatement. 

The  resulting  million-fold  increase  has  made  cer¬ 
tain  algorithms,  once  thought  impractical,  practical  to 
run.  In  the  case  of  recognition,  a  question  that  has  not 
been  adequately  addressed  experimentally  is  whether 
near-term  increases  in  computational  resources  cou¬ 
pled  with  modest  algorithmic  “smarts”  can  achieve, 
for  certain  problems,  what  algorithmic  improvement 
alone  has  failed  to  do,  making  recognition,  in  one 
sense,  a  relatively  “easy”  problem.  We  think  the  an- 
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swer  to  this  question  is  yes,  and  furthermore  that  a 
threshold  that  allows  interesting  results  to  be  obtained 
from  such  experiments  without  exclusive  use  of  a  su¬ 
percomputer  has  recently  been  crossed. 

In  support  of  the  position  that  resource  intensive 
methods  are  worth  looking  at  closely  with  today’s 
power,  we  present  some  results  for  an  appearance- 
based  system  which  we  believe  represent  the  best  3- 
D  recognition  results  reported  anywhere  for  general 
rigid  objects.  The  system  uses  only  a  modest  amount 
of  force  (relatively  speaking)  and  only  a  little  of  the 
available  algorithmic  cleverness,  yet  the  results  sug¬ 
gest  not  only  scalable  general  rigid  object  recognition, 
but  good  performance  in  the  presence  of  clutter  and 
some  surprising  generic  ability. 

2  Background 

The  most  successful  visual  recognition  work  to  date 
has  been  using  model-based  systems.  Notable  recent 
examples  are  fll,  10,  9,  6].  The  3D  geometric  mod¬ 
els  on  which  these  systems  are  based  are  both  their 
strength  and  their  weakness.  [8,  7].  On  the  one 
hand,  explicit  models  provide  a  framework  that  al¬ 
lows  powerful  geometric  constraints  to  be  utilized  to 
good  effect.  On  the  other,  model  schemas  are  gener¬ 
ally  severely  limited  in  the  sort  of  objects  that  they 
can  represent,  and  obtaining  the  models  is  typically 
a  difficult  and  time-consuming  process.  There  has 
been  a  fair  amount  of  work  on  automatic  acquisition 
of  geometric  models,  mostly  with  range  sensors,  e.g., 
17,  19,  2]  but  also  visually,  for  various  representations 
20,  3,  1,  5].  However,  these  techniques  are  liinited  to 
a  particular  geometric  schema,  and  even  within  their 
domain,  especially  with  visual  techniques,  their  per¬ 
formance  is  often  unsatisfactory. 

Appearance-based  object  recognition  methods  are 
resource-intensive  algorithms  that  have  been  proposed 
in  order  to  make  recognition  systems  more  general, 
and  more  easily  trainable  from  visual  data.  Most  of 
them  essentially  operate  by  comparing  an  image-like 
representation  of  object  appearance  against  many  pro¬ 
totype  representations  stored  in  a  memory,  and  finding 
the  closest  match.  They  have  the  advantage  of  being 
fairly  general,  and  often  easily  trainable.  In  recent 
work,  Poggio  has  recognized  wire  objects  and  faces 
[15,  4].  Rao  and  Ballard  [16]  describe  an  approach 
Isased  on  the  memorization  of  the  responses  of  a  set 
of  steerable  filters.  Mel  [12]  takes  a  somewhat  similar 
approach  using  a  database  of  stored  feature  vectors 
representing  multiple  low-level  cues.  Murase  and  Na- 
yar  [13]  find  the  major  principal  components  of  an 
image  dataset,  and  use  the  projections  of  unknown 
images  onto  these  as  indices  into  a  recognition  mem¬ 
ory.  Schmid  and  Mohr  [18]  have  recently  reported 
good  results  for  an  appearance  based  system  with  a 
local-feature  approach  similar  in  spirit  to  what  we  use, 
though  with  different  features  and  a  much  simpler  ev¬ 
idence  combination  scheme. 

In  general,  appearance-based  methods  have  proven 
to  be  a  useful  technique;  however  because  matches 
are  generally  made  to  representations  of  complete  ob¬ 
jects,  these  methods  tend  to  be  more  sensitive  to  clut¬ 
ter  and  occlusion  than  is  desirable,  and  require  good 
global  segmentation  for  success.  Hough  transform 


and  other  voting  methods  allow  evidence  from  discon¬ 
nected  parts  to  be  effectively  combined,  but  the  size 
of  the  voting  space  increases  exponentially  with  the 
number  of  degrees  of  visual  freedom.  This  makes  it 
difficult  to  apply  such  techniques  directly  when  more 
than  about  3  DOF  are  involved,  thus  limiting  the  use 
of  the  technique  for  3D  object  recognition,  which  gen¬ 
erally  involves  at  least  6  DOF. 

We  have  implemented  a  prototype  system  that,  by 
combining  a  large  appearance  database  of  semi-local, 
intermediate-level  key  features  with  a  Hough-like  evi¬ 
dence  combination  technique,  resolves  both  the  clutter 
and  occlusion  sensitivity  of  traditional  memory-based 
methods,  and  the  space  problems  of  voting  methods 
for  high  DOF  problems.  This  system  demonstrates 
robust  recognition  of  a  variety  of  3-D  shapes,  rang¬ 
ing  from  sports  cars  and  fighter  planes  to  snakes  and 
lizards  over  full  spherical  or  hemispherical  ranges  (and 
planar  scale,  translation  and  rotation).  It  is  also  ro¬ 
bust  against  clutter,  and  demonstrates  some  generic 
ability.  This  is  in  contrast  to  some  recent  results  e.g. 
Murase  and  Nayar  [13]  where  essentially  only  one  of 
the  two  out-of-plane  rotational  degrees  of  freedom  is 
spanned,  and  clutter  is  a  significant  problem. 

3  The  Method 
3.1  Overview 

The  basic  notion  is  to  represent  the  visual  appear¬ 
ance  of  an  object  as  a  structured  combination  of  a 
number  of  semi-local  features,  or  fragments.  The  idea 
is,  that  under  different  conditions  (e.g.  lighting,  back¬ 
ground,  changes  in  orientation  etc.)  the  feature  ex¬ 
traction  process  will  find  some  of  these,  but  in  gen¬ 
eral  not  all  of  them.  However,  we  show  that  the  frac¬ 
tion  that  is  found  by  feature  extraction  processes  is 
frequently  sufficient  to  identify  objects  in  the  scene. 
This  addresses  one  of  the  principle  problems  of  object 
recognition,  which  is  that,  in  any  but  rather  artificial 
conditions,  it  has  so  far  proved  impossible  to  reliably 
segment  whole  objects  on  a  bottom-up  basis.  In  this 
paper,  local  features  based  on  automatically  extracted 
boundary  fragments  are  used  to  represent  multiple  2- 
D  views  of  rigid  3-D  objects,  but  the  basic  idea  could 
be  applied  to  other  features  and  other  representations. 

In  more  detail,  we  make  use  of  semi-invariant  local 
objects  we  call  keys.  A  key  is  any  robustly  extractable 
part  or  feature  that  has  sufficient  information  content 
to  specify  a  configuration  of  an  associated  object  plus 
enough  additional  parameters  to  provide  efficient  in¬ 
dexing  and  meaningful  verification.  The  basic  idea  is 
to  utilize  a  database  (here  viewed  as  an  associative 
memory)  organized  so  that  access  via  a  key  feature 
evokes  associated  hypotheses  for  the  identity  and  con¬ 
figuration  of  all  objects  that  could  have  produced  it. 
These  hypothesis  are  fed  into  a  second  stage  associa¬ 
tive  memory,  keyed  by  the  configuration,  which  main¬ 
tains  a  probabilistic  estimate  of  the  likelihood  of  each 
hypothesis  based  on  statistics  about  the  occurrence  of 
the  keys  in  the  primary  database.  In  our  case,  since 
3-D  objects  are  represented  by  a  set  of  views,  the  con¬ 
figurations  represent  two  dimensional  transforms.  Ef¬ 
ficient  access  to  the  associative  memories  is  achieved 
using  a  hashing  scheme. 
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One  step  that  we  do  not  take  in  the  current  system 
is  whole-object  verification  of  the  top  hypotheses.  Un¬ 
like  appearance-based  systems  based  on  whole-object 
appearance,  the  structure  of  our  representation  is  such 
that  this  could  be  performed  to  advantage,  and  such  a 
step  has  the  potential  to  significantly  improve  the  per¬ 
formance  of  the  system  as  a  whole.  The  results  given 
should  thus  be  interpreted  as  representing  the  power 
of  an  initial  hypothesis  generator  or  indexing  system. 

3.2  Key  Features 

The  recognition  technique  is  based  on  the  assump¬ 
tion  that  robustly  extractable,  semi-invariant  keys  can 
be  efficiently  recovered  from  image  data.  More  specif¬ 
ically,  the  keys  must  posses  the  following  character¬ 
istics.  First,  they  must  be  complex  enough  not  only 
to  specify  the  configuration  of  the  object,  but  to  have 
parameters  left  over  that  can  be  used  for  indexing. 
Second,  the  keys  must  have  a  substantial  probability 
of  detection  if  the  object  containing  them  occupies  the 
region  of  interest  (robustness).  Third,  the  index  pa¬ 
rameters  must  change  relatively  slowly  as  the  object 
configuration  changes  (semi-invariance). 

We  currently  make  use  of  a  single  key  feature  type 
consisting  of  curve  orientation  templates  normalized 
by  robust  boundary  fragments.  We  call  these  features 
curve  patches.  Specifically,  a  curve-finding  algorithm 
is  run  on  an  image,  producing  a  set  of  segmented  con¬ 
tour  fragments  broken  at  points  of  high  curvature. 
The  longest  curves  are  selected  as  key  curves,  and  a 
fixed-size  template  (21  x  21)  constructed  with  a  base 
segment  determined  by  the  endpoints  (or  the  diameter 
in  the  case  of  closed  or  nearly  closed  curves)  of  the  key 
curve  occupying  a  canonical  position  in  the  template. 
All  image  curves  that  intersect  the  normalized  tem¬ 
plate  are  mapped  into  it  with  a  code  specifying  their 
orientation  relative  to  the  base  segment.  Matching  of 
a  candidate  template  involves  taking  the  model  patch 
curve  points  and  verifying  that  a  curve  point  with  sim¬ 
ilar  orientation  lies  nearby  in  the  candidate  template. 
Essentially  this  amounts  to  directional  correlation. 

3.3  Recognition  Procedure 

In  order  to  recognize  objects,  we  must  first  prepare 
a  database  against  which  the  matching  takes  place.  To 
do  this,  we  first  take  a  number  of  images  of  each  ob¬ 
ject,  covering  the  region  on  the  viewing  sphere  over 
which  the  object  may  be  encountered.  The  exact 
number  of  images  per  object  may  vary  depending  on 
the  features  used  and  any  symmetries  present,  but  for 
the  patch  features  we  use,  obtaining  training  images 
about  every  20  degrees  is  sufficient.  To  cover  the  en¬ 
tire  sphere  at  this  sampling  requires  about  100  images. 
For  every  image  so  obtained,  the  boundary  extraction 
procedure  is  run,  and  the  best  25  or  so  boundaries 
are  selected  as  keys,  from  which  patches  are  gener¬ 
ated  and  stored  in  the  database.  With  each  patch  is 
associated  the  identity  of  the  object  that  produced  it, 
the  viewpoint  it  was  taken  from,  and  three  geometric 
parameters  specifying  the  2-D  size,  location,  and  ori¬ 
entation  of  the  image  of  the  object  relative  to  the  key 
curve.  This  information  permits  a  hypothesis  about 
the  identity,  viewpoint,  size,  location  and  orientation 
of  an  object  to  be  made  from  any  match  to  the  patch 
feature. 


The  basic  recognition  procedure  consists  of  four 
steps.  First,  potential  key  features  are  extracted  from 
the  image  using  low  and  intermediate  level  visual  rou¬ 
tines.  In  the  second  step,  these  keys  are  used  to  access 
the  database  memory  and  retrieve  information  about 
what  objects  could  have  produced  them,  and  in  what 
relative  configuration.  The  third  step  uses  this  infor¬ 
mation  to  produce  hypotheses  about  the  identity  and 
configuration  of  potential  objects.  Finally,  these  hy¬ 
potheses  are  themselves  used  as  keys  into  a  second 
associative  memory,  where  evidence  for  them  is  accu¬ 
mulated.  After  all  features  have  been  so  processed,  the 
hypothesis  with  the  highest  evidence  score  is  selected. 
Secondary  hypotheses  can  also  be  reported. 

3.4  Evidence  Combination 

In  the  final  step  described  above,  an  important  is¬ 
sue  is  the  method  of  combining  evidence.  The  sim¬ 
plest  technique  is  to  use  an  elementary  voting  scheme 
-  each  piece  of  evidence  contributes  equally  to  the  to¬ 
tal.  This  is  clearly  not  well  founded,  as  a  feature  that 
occurs  in  many  different  situations  is  not  as  good  an 
indicator  of  the  presence  of  an  object  as  one  that  is 
unique  to  it.  For  example,  with  24  3-D  objects  stored 
in  the  database,  comprising  over  30,000  patches,  we 
find  that  some  image  features  match  1000  or  more 
database  features,  while  others  match  only  one  or  two. 
An  evidence  scheme  that  takes  this  into  account  would 
probably  display  improved  performance.  An  obvious 
approach  in  our  case  is  to  use  statistics  computed  over 
the  information  contained  in  the  associative  memory 
to  evaluate  the  quality  of  a  piece  of  information.  It  is 
clear  that  the  optimal  quality  measure,  which  would 
rely  on  the  full  joint  probability  distribution  over  keys, 
objects  and  configurations  is  infeasible  to  compute, 
and  thus  we  must  use  some  approximation. 

A  simple  example  would  be  to  use  formally,  the 
first  order  feature  frequency  distribution  over  the  en¬ 
tire  database,  and  this  is  what  we  do.  The  actual 
algorithm  is  to  accumulate  evidence,  for  each  match 
supporting  a  pose,  proportional  to  Flog(A:/m)  where 
m  is  the  number  of  matches  to  the  image  feature  in  the 
whole  database,  and  k  is  a,  proportionality  constant 
that  attempts  to  make  m/k  represent  the  actual  geo¬ 
metric  probability  that  some  image  feature  matches  a 
particular  patch  in  the  pose  model  by  accident.  It  can 
be  shown  that  maximizing  the  summed  reciprocal  log 
terms  is  equivalent  to  Bayesian  maximum  likelihood 
evidence  combination  using  the  match  frequency  as 
an  estimate  of  the  prior  probability  of  the  the  feature 
type,  and  assuming  independence  of  observations.  F 
represents  an  additional  empirical  factor  proportional 
to  the  square  root  of  the  size  of  the  feature  in  the  im¬ 
age,  and  the  4th  root  of  the  number  of  key  features 
in  the  model.  These  modifications  capture  certain  as¬ 
pects  that  seem  important  to  the  recognition  process, 
but  are  difficult  to  model  using  formal  probability,  (es¬ 
sentially  that  bigger  features  are  better,  and  that  the 
simplest  explanation  is  preferred) 

The  above  measure  allows  us  to  combine  evidence 
for  all  feature  matches  associated  with  a  given  pose  hy¬ 
pothesis  and  a  set  of  evidence.  We  now  want  to  find 
the  maximum  of  this  over  all  possible  poses.  Clearly, 
we  can’t  directly  evaluate  all  pose  hypotheses:  there 
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are  too  many  of  them  (e.g.  20  objects  x  100  view¬ 
points  X  100  image  locations  x  20  orientations  x  10 
sizes  =  40,000,000  poses  to  check).  In  our  algorithm, 
the  indexing  into  the  secondary  associative  memory 
functions  as  an  efficient  way  of  accumulating  the  evi¬ 
dence  for  all  poses  that  have  any  evidence  associated 
with  them  at  all  (most  possible  poses  have  none,  for  a 
given  set  of  evidence).  This  is  the  basic  Hough  trans¬ 
form  idea,  and  it  permits  the  pose  with  maximum  evi¬ 
dence  to  be  found  in  time  proportional  to  the  number 
of  pieces  of  evidence  times  a  database  lookup  factor 
rather  than  in  time  proportional  to  the  number  of  pos¬ 
sible  poses. 

3.5  Implementation 

Using  the  principles  described  above,  we  imple¬ 
mented  a  recognition  system  for  rigid  3-D  objects.  The 
system  needs  a  particular  shape  or  pattern  to  index 
on,  and  does  not  work  well  for  objects  whose  charac¬ 
ter  is  statistical,  such  as  generic  trees  or  pine  cones. 
Component  boundaries  were  extracted  by  modifying  a 
stick-growing  method  for  finding  segments  developed 
recently  at  Rochester  [14]  so  that  it  could  follow  curved 
boundaries. 

The  system  is  trained  using  images  taken  approx¬ 
imately  every  20  degrees  around  the  sphere,  amount¬ 
ing  to  about  100  views  for  a  full  sphere,  and  50  for  a 
hemisphere,  of  making  the  templates  sufficiently  flex¬ 
ible  to  match  between  views.  For  objects  entered  into 
the  database,  the  best  25  key  features  were  selected  to 
represent  the  object  in  each  view.  The  thresholds  on 
the  distance  metrics  between  features  were  adjusted  so 
that  they  would  tolerate  approximately  15-20  degrees 
deviation  in  the  appearance  of  a  frontal  plane  (less  for 
oblique  ones) . 

The  system  can  be  used  both  for  “recognition” 
(what  is  this)  and  “finding”  (where  is  object  X  in 
this  scene)  operations.  Preliminary  experiments  on 
the  finding  operation  indicate  good  performance  in 
large,  complex  scenes,  but  we  have  not  yet  acquired 
even  moderate  test  databases  for  this  problem,  so  the 
experiments  reported  below  all  involve  “recognition” 
tasks. 

/subsectionResource  Requirements  The  resource 
requirements  scale  more  or  less  linearly  with  the  size 
of  the  database.  Memory  is  about  3  Mbytes  per  hemi¬ 
sphere,  and  overall  times  on  a  single  processor  Ultra- 
spare  are  about  20  seconds  for  the  6  object  database, 
and  about  2  minutes  for  the  24  object  database.  These 
numbers  could  almost  certainly  be  improved  by  push¬ 
ing  on  the  indexing  and  data  replication,  which  we 
have  not  done  as  yet. 

4  Experiments 

4.1  Variation  in  Performance  with  Size  of 
Database 

One  measure  of  the  performance  of  an  object  recog¬ 
nition  system  is  how  the  performance  changes  as  the 
number  of  classes  increases.  To  test  this,  we  obtained 
test  and  training  images  for  a  number  of  objects,  and 
built  3-D  recognition  databases  using  different  num¬ 
bers  of  objects.  The  objects  used  were  chosen  to  be 
“different”  in  that  they  were  easy  for  people  to  dis¬ 
tinguish  on  the  basis  of  shape.  Data  was  acquired  for 


24  different  objects  (34  hemispheres).  The  objects  are 
shown  in  Figure  1.  The  number  of  hemispheres  is  not 
equal  to  twice  the  number  of  objects  because  a  num¬ 
ber  of  the  objects  were  either  unrealistic  or  painted 
flat  black  on  the  bottom  which  made  getting  training 
data  against  a  black  background  difficult. 

Clean  image  data  was  obtained  automatically  us¬ 
ing  a  combination  of  a  robot-mounted  camera,  and  a 
computer  controlled  turntable  covered  in  black  velvet. 
Training  data  consisted  of  53  images  per  hemisphere, 
spread  fairly  uniformly,  with  approximately  20  degrees 
between  neighboring  views.  The  test  data  consisted  of 
24  images  per  hemisphere,  positioned  in  between  the 
training  views,  and  taken  under  the  same  good  con¬ 
ditions.  Note  that  this  is  essentially  a  test  of  invari¬ 
ance  under  out-of-plane  rotations,  the  most  difficult  of 
the  6  orthographic  freedoms.  The  planar  invariances 
are  guaranteed  by  the  representation,  once  above  the 
level  of  feature  extraction,  and  experiments  testing 
this  have  shown  no  degradation  due  to  translation,  ro¬ 
tation,  and  scaling  up  to  50%.  Larger  changes  in  scale 
have  been  accommodated  using  a  multi-resolution  fea¬ 
ture  finder,  which  gives  us  4  or  5  octaves  at  the  cost 
of  doubling  the  size  of  the  database. 

We  ran  tests  with  databases  built  for  6,  12,  18  and 
24  objects,  shown  in  Figure  1,  and  obtained  overall 
success  rates  (correct  classification  on  forced  choice)  of 
99.6%,  98.7%  97.4%  and  97.0%  respectively.  (To  find 
out  which  objects  are  in  which  database,  just  count  the 
images  left  to  right,  top  to  bottom.)  The  results  are 
summarized  in  the  following  table.  The  worst  cases 
were  the  horse  and  the  wolf  in  the  24  object  test,  with 
19/24  and  20/24  correct  respectively.  On  inspection, 
some  of  these  pictures  were  difficult  for  human  sub¬ 
jects.  None  of  the  other  examples  had  more  than  2 
misses  out  of  the  24  (hemisphere)  or  48  (full  sphere) 
test  cases.  Overall,  the  performance  is  fairly  good.  In 
fact,  we  believe  this  represents  the  best  results  pre¬ 
sented  anywhere  for  this  sort  of  problem. 


num.  of 
objects 

num.  of 
hemi¬ 
spheres 

num.  of 
test 
images 

num. 

correct 

percent 

correct 

- 6 

— n 

2M 

263 

WB 

12 

18 

408 

403 

98.7 

18 

26 

576 

561 

97.4 

24 

34 

768 

745 

97.0 

Table  1 :  Performance  of  forced-choice  recognition  for 
databases  of  different  sizes 


4.2  Performance  in  the  Presence  of  Clut¬ 
ter 

The  feature-based  nature  of  the  algorithrn  pro¬ 
vides  some  immunity  to  the  presence  of  clutter  in  the 
scene,  in  contrast  to  appearance-based  schemes  that 
use  the  structure  of  the  full  object,  and  require  good 
global  segmentation.  For  modest  dark-field  clutter, 
the  method  is  quite  robust.  To  test  this,  we  acquired 
test  sets  of  the  six  objects  used  in  the  previous  6-object 
case  in  the  presence  of  non-occluding  clutter.  Exam¬ 
ples  of  the  test  images  are  shown  in  Figure  2  Out  of 
264  test  cases,  252  were  classified  correctly  which  gives 
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Figure  1:  The  objects  used  in  testing  the  system 
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a  recognition  rate  of  about  96%,  compared  to  99%  for 
uncluttered  test  images.  A  confusion  matrix  is  shown 
in  Figure  3 
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Figure  3:  Error  matrix  for  object  classification  exper¬ 
iment  with  clutter.  Columns  contain  counts  of  classi¬ 
fication  results  for  test  images  of  each  type. 

In  a  second  experiment,  we  took  pictures  of  the 
objects  against  a  light  background.  Clutter  in  these 
images  arises  from  shadows,  from  wrinkles  in  the  fab¬ 
ric,  and  from  a  substantial  shading  discontinuity  be¬ 
tween  the  turntable  and  the  background.  Unlike  the 
dark-field  pictures,  the  objects  in  many  of  these  pic¬ 
tures  are  not  trivially  segmentable.  Examples  of  the 
test  images  are  shown  in  Figure  4,  and  the  boundaries 
found  in  Figure  5.  Note  that  some  of  the  imaps  pro¬ 
duce  substantial  numbers  of  clutter  curves.  (All  the 
images  shown  were  classified  correctly.) 

Out  of  264  test  cases,  236  were  classified  correct^ 
which  gives  an  overall  recognition  rate  of  about  90%, 
which  is  not  as  good  as  some  of  our  other  results. 
However,  almost  half  the  errors  were  due  to  instances 
of  the  toy  bear,  the  reason  being  that  the  gray  level  of 
the  bear’s  body  was  so  close  to  the  upper  background 
in  low-level  shots  that  many  of  the  main  boundaries 
could  not  be  found  (people  had  trouble  with  these 
shots  too).  If  this  case  is  excluded,  the  rate  is  about 
94%.  A  confusion  matrix  is  shown  in  Figure  6 


class  1 

ref 

1 

num  1 

0 

1 

2 

3 

4 

5 

cup  1 

0 

1 

48  1 

44 

2 

0 

1 

1 

0 

bear  1 

1 

1 

48  1 

3 

32 

1 

5 

2 

5 

car  1 

2 

1 

24  1 

0 

0 

24 

0 

0 

0 

rabbit  I 

3 

1 

48  1 

1 

0 

0 

47 

0 

0 

plane  1 

4 

1 

48  1 

0 

0 

0 

0 

45 

3 

fightr  1 

5 

1 

48  1 

0 

0 

1 

0 

3 

44 

Hypoths . 

for 

class  1 

48 

34 

26 

53 

51 

52 

Figure  6:  Error  matrix  for  light  field  classification  ex¬ 
periment.  Columns  contain  counts  of  classification  re¬ 
sults  for  test  images  of  each  type. 


4.3  Experiments  on  “Generic”  Recogni¬ 
tion 

This  set  of  experiments  was  suggested  when,  we 
tried  showing  our  coffee  mugs  to  an  early  version  of  the 
system  that  had  been  trained  on  the  creamer  cup  in 
the  previous  database  (among  other  objects),  and  no¬ 
ticed  that  the  system  was  making  the  “correct”  generic 
call  a  significant  percentage  of  the  time.  Moreover,  the 
features  that  were  keying  the  classification  were  the 
“right”  ones,  i.e.,  boundaries  derived  from  the  handle, 
and  the  circular  sections,  even  though  there  was  no 
explicit  part  model  of  a  cup  in  the  system. 

The  notion  of  generic  visual  classes  is  ill  defined 
scientifically.  What  we  have  is  human  subjective  im¬ 
pressions  that  certain  objects  look  alike,  and  belong 
in  the  same  group  (e.g.  airplanes,  sports  cars,  spiders, 
teapots  etc.)  Unfortunately,  human  visual  classes  tend 
to  be  confounded  with  functional  classes,  and  biased 
by  experience  and  other  factors  to  an  extent  that 
makes  formalizing  such  classes,  even  phenomenologi¬ 
cally,  pretty  tough.  On  the  other  hand,  the  subjective 
intuition  is  so  strong,  and  the  early  evidence  of  correct 
“generalization”  so  intriguing,  that  the  matter  seemed 
worth  looking  into. 

For  the  test,  we  gathered  multiple  examples  of  ob¬ 
jects  from  several  classes,  which  an  (informal)  sample 
of  human  volunteers  agreed  looked  pretty  much  alike 
(our  rough  criterion  was  you  could  tell  at  a  glance 
what  class  an  object  was  in,  but  had  to  take  a  “second 
look”  to  determine  which  member  of  the  class  it  was. 
We  ended  up  with  five  classes  consisting  of  11  cups,  6 
“normal”  airplanes,  6  fighter  jets,  9  sports  cars,  and  8 
snakes. 

The  recognition  system  was  trained  on  a  subset  of 
each  class,  and  tested  on  the  remaining  elements.  The 
training  sets  consisted  of  4  cups,  3  airplanes,  3  jet 
fighters,  4  sports  cars,  and  4  snakes.  These  classes 
are  shown  in  Figure  7,  with  the  training  objects  on 
the  left  of  each  picture,  and  the  test  objects  on  the 
right.  The  training  and  test  views  were  taken  accord¬ 
ing  to  the  same  protocol  as  in  the  previous  experi¬ 
ment.  The  cups,  planes,  and  fighter  jets  were  sam¬ 
pled  over  the  full  sphere;  the  cars  and  snakes  over  the 
top  hemisphere  (the  bottom  sides  were  not  realisti¬ 
cally  sculpted).  Overall  performance  on  forced  choice 
classification  for  792  test  images  was  737  correct,  or 
93.0%.  If  we  average  performance  for  each  group  so 
that  the  fact  that  the  best  group,  the  cups,  does  not 
get  weighted  more  because  we  had  more  samples,  we 
get  92%  (91.96%)  performance.  The  error  matrix  is 
shown  in  Figure  8 

The  performance  is  best  for  the  cups  at  about  98%, 
and  the  planes,  sports  cars  and  snakes  came  in  around 
92%-94%.  The  fighter  planes  were  the  worst  by  a  sig¬ 
nificant  factor,  at  about  83%.  The  reason  seems  to  be 
that  there  is  quite  a  bit  of  difference  between  the  ex¬ 
emplars  in  some  views  in  terms  of  armament  carried, 
which  tends  to  break  up  some  of  the  lines  in  a  way 
the  current  boundary  finder  does  not  handle.  Two  of 
the  test  cases  also  have  camouflage  patterns  painted 
on  them.  The  snakes  were  actually  a  bit  of  a  surprise, 
given  the  degree  of  flexibility,  and  the  fact  that  none  of 
the  curves  are  actually  the  same  (this  is  supposedly  a 
rigid  object  recognition  system).  The  key  seems  to  be 
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Figure  2:  Examples  of  test  images  with  modest  dark-field  clutter 
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Figure  4:  Examples  of  test  images  on  light  background,  with  shadows  and  minor  texture 


Figure  5:  Curves  found  by  boundary  extraction  algorithm  in  light  background  images 


Figure  7:  Test  sets  used  in  generic  recognition  experiment.  The  training  objects  are  on  the  left  side  of  each  image 
(4  cups,  3  planes,  3  fighters,  4  cars,  4  snakes)  and  the  test  objects  are  on  the  right. 
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Figure  8:  Error  matrix  for  generic  classification  exper¬ 
iment.  Columns  contain  counts  of  classification  results 
for  test  images  of  each  type. 


the  generic  “S”  shape,  which  recurs  in  various  ways  in 
all  the  exemplars,  and  is  quite  rare  in  general  scenes. 

These  results  do  not  say  anything  conclusive  about 
the  nature  of  “generic”  recognition,  but  they  do  sug¬ 
gest  a  route  by  which  generic  capability  could  arise  in 
an  appearance  based  system  that  was  initially  targeted 
at  recognizing  specific  objects,  but  needed  enough  flex¬ 
ibility  to  be  able  to  deal  with  inter-pose  variability  and 
environmental  lighting  effects.  They  also  suggest  that 
one  way  of  viewing  generic  classes  is  that  they  cor¬ 
respond  to  clusters  in  a  (relatively)  spatially  uniform 
metric  space  defined  by  a  general,  context-free,  classi¬ 
fication  process.  Finer  distinctions  would  make  use  of 
this  context. 

5  Conclusions  and  Future  Work 

In  this  paper  we  have  described  a  framework 
for  keyed  appearance-based  3-D  recognition,  which 


avoids  some  of  the  problems  of  previous  appearance- 
based  schemes.  We  ran  various  large-scale  perfor¬ 
mance  tests  and  found  good  performance  for  full- 
sphere/hemisphere  recognition  of  up  to  24  complex, 
curved  objects,  robustness  against  clutter,  and  some 
intriguing  generic  recognition  behavior. 

Future  plans  include  adding  enough  additional  ob¬ 
jects  to  push  the  performance  below  75%,  both  to  bet¬ 
ter  observe  the  functional  form  of  the  error  dependence 
on  scale,  and  to  provide  a  basis  for  substantial  im¬ 
provement.  We  also  want  to  see  how  the  performance 
can  be  improved  by  adding  a  final  verification  stage, 
since  we  have  observed  that  even  when  the  system  pro¬ 
vides  the  wrong  answer,  the  “right”  one  is  generally 
in  the  top  few  hypotheses.  In  another  direction,  we 
have  some  preliminary  results  indicating  that  the  sys¬ 
tem,  when  coupled  with  a  simple  memory-constraint 
protocol,  functions  very  well  for  finding  particular  ob¬ 
jects  in  large,  highly  cluttered  scenes.  We  plan  to 
gather  enough  data  for  this  problem  to  generate  statis¬ 
tically  significant  performance  data.  Finally,  we  want 
to  experiment  with  adapting  the  system  to  allow  fine 
discrimination  of  similar  objects  (same  generic  class) 
using  directed  processing  driven  by  the  generic  classi¬ 
fication. 
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Recognizing  Objects  by  Matching  Oriented  Points 
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We  present  an  approach  to  recognition  of  complex 
objects  in  cluttered  3-D  scenes  that  does  not 
require  feature  extraction  or  segmentation.  Our 
object  representation  comprises  descriptive 
images  associated  with  each  oriented  point  on  the 
surface  of  an  object.  Using  a  single  point  basis  con¬ 
structed  from  an  oriented  point,  the  position  of 
other  points  on  the  surface  of  the  object  can  be 
described  by  two  parameters.  The  accumulation  of 
these  parameters  for  many  points  on  the  surface  of 
the  object  results  in  an  image  at  each  oriented 
point.  These  images,  localized  descriptions  of  the 
global  shape  of  the  object,  are  invariant  to  rigid 
transformations.  Through  correlation  of  images, 
point  correspondences  between  a  model  and  scene 
data  are  established.  Geometric  consistency  is  used 
to  cluster  the  correspondences  from  which  a  plau¬ 
sible  rigid  transformation  that  align  the  model  with 
the  scene  is  calculated.  The  transformations  is  then 
refined  and  verified  using  a  modified  iterative  clos¬ 
est  point  algorithm.  The  effectiveness  of  our  algo¬ 
rithm  is  demonstrated  with  results  showing 
recognition  of  complex  objects  in  cluttered  scenes 
with  occlusion. 

1.  Introduction 

For  recognition  of  complex  objects,  we  have  devel¬ 
oped  a  representation  that  combines  the  descriptive 
nature  of  global  object  properties  with  the  robust¬ 
ness  to  partial  views  and  clutter  of  local  shape 
descriptions.  Specifically,  a  local  basis  is  computed 
at  an  oriented  point  (3-D  point  with  surface  nor¬ 
mal)  on  the  surface  of  an  object  represented  as  a 
polygonal  surface  mesh.  The  positions  with  respect 
to  the  basis  of  other  points  on  the  surface  of  the 
object  can  then  be  described  by  two  parameters.  By 
accumulating  these  parameters  in  a  2-D  array,  a 
descriptive  image  associated  with  the  point  is  cre¬ 
ated.  Because  the  image  describes  the  coordinates 
of  points  on  the  surface  of  an  object  with  respect  to 
the  local  basis,  it  is  a  local  encoding  of  the  global 


shape  of  the  object  and  is  invariant  to  rigid  trans¬ 
formations.  To  prepare  a  model  for  recognition,  an 
image  is  generated  for  each  point  on  the  model. 
Since  an  image  is  generated  at  each  point  in  the 
surface  mesh,  error  prone  feature  extraction  and 
segmentation  are  avoided.  At  recognition  time, 
images  from  points  on  the  model  are  compared 
with  images  from  points  in  the  scene;  when  two 
images  are  similar  enough,  a  point  correspondence 
between  model  and  scene  is  established.  Several 
point  correspondences  are  then  used  to  calculate  a 
transformation  from  model  to  scene  for  verifica¬ 
tion. 

Our  recognition  technique  developed  from  a  com¬ 
bination  of  basis  geometric  hashing  proposed  by 
Lamdan  and  Wolfson  [11]  and  structural  indexing 
proposed  by  Stein  and  Medioni  [13].  Because  we 
use  information  from  the  entire  surface  of  the 
object  in  our  representation,  instead  of  a  curve  or 
surface  patch  in  the  vicinity  of  the  point,  our  repre¬ 
sentation  is  more  discriminating  than  the  curves 
used  to  date  in  structural  indexing.  Furthermore, 
because  bases  are  computed  from  single  points,  our 
method  does  not  have  the  combinatoric  explosion 
present  in  basis  geometric  hashing  as  the  amount 
of  points  is  increased.  In  our  algorithm,  every  point 
on  the  model  that  is  visible  in  the  scene  can  be 
matched.  This  is  in  contrast  to  geometric  hashing 
where  only  select  feature  points  can  be  matched, 
making  its  effectiveness  dependent  on  feature 
extraction. 

The  idea  of  encoding  the  relative  position  of  many 
points  on  the  surface  of  an  object  in  an  image  or 
histogram  is  not  new.  Ikeuchi  et.  al.  [10]  propose 
invariant  histograms  for  SAR  target  recognition. 
This  work  is  view-based  and  requires  feature 
extraction.  Gu  eziec  and  i^ache  [5]  store  parame¬ 
ters  for  all  points  along  a  curve  in  a  hash  table  for 
efficient  matching  of  3-D  curves.  Their  method 
requires  the  extraction  of  extremal  curves  from  3-D 
images. 
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Chua  and  Jarvis  [3]  present  an  algorithm  for 
matching  3-D  free-form  surfaces  by  matching 
points  based  on  principal  curvatures.  Similarly, 
Thirion  [15]  presents  an  algorithm  for  matching  3- 
D  images  based  on  the  matching  of  extremal  points 
using  curvatures  and  Darboux  frames.  Pipitone  and 
Adams  [12]  propose  the  tripod  operator  which, 
when  placed  on  the  surface  of  an  object,  generates 
a  few  parameters  describing  surface  shape.  Ber- 
gevin  et.  al.  [2]  propose  a  registration  algorithm 
based  on  matching  properties  of  triangles  gener¬ 
ated  from  a  hierarchical  tessellation  of  an  object’s 
surface.  Our  approach  differs  from  these  because 
the  images  computed  at  each  point  are  much  more 
discriminating  than  principal  curvatures  and  angles 
between  frames  measured  at  a  point.  The  discrim- 
inability  of  spin-images  greatly  reduces  the  num¬ 
ber  possible  correspondences  between  points. 

During  recognition,  a  large  number  of  points  over 
the  entire  scene  are  matched  to  model  points.  Since 
many  correspondences  are  established,  the  validity 
of  individual  correspondences  can  be  determined 
by  comparing  them  to  the  properties  of  the  group 
of  correspondences.  Using  this  reasoning  our  algo¬ 
rithm  is  able  to  select  the  best  correspondences 
even  in  the  presence  of  complex  scenes  with  clutter 
and  occlusions. 

The  paper  is  organized  as  follows:  First,  we 
describe  how  the  images  used  in  matching  oriented 
points  are  generated  and  compared.  Next,  we 
explain  the  robustness  of  the  images  to  clutter  and 
occlusion  and  describe  how  correspondences  are 
grouped  to  compute  plausible  transformations 
from  model  to  scene.  We  then  explain  our  verifica¬ 
tion  algorithm,  which  is  based  on  the  iterative  clos¬ 
est  point  algorithm.  Finally,  we  show  recognition 
results  in  cluttered  scenes  and  conclude  with  a  dis¬ 
cussion  of  future  work. 

2.  Spin-Images 

The  fundamental  shape  element  we  use  for  match¬ 
ing  is  an  oriented  point,  a  three-dimensional  point 
with  an  associated  direction.  We  define  an  oriented 
point  O  on  a  surface  mesh  of  an  object  using  vertex 
position  p  and  surface  normal  n  (defined  as  the 
normal  of  the  best  fit  plane  to  the  point  and  its 
neighbors  in  the  mesh  oriented  to  the  outside  of  the 
object).  As  shown  in  Figure  1,  an  oriented  point 


Figure  1;  The  geometry  of  an  oriented  point  basis. 

defines  a  2-D  basis  (p,n)  (i.e.,  local  coordinate  sys¬ 
tem)  using  the  tangent  plane  ^  through  p  oriented 
perpendicularly  to  n  and  the  line  £,  through  p  paral¬ 
lel  to  n.  The  two  coordinates  of  the  basis  are  a,  the 
perpendicular  distance  to  the  line  L,  and  P  the 
signed  perpendicular  distance  to  the  plane  T.  A 
spin-map  Sq  is  the  function  that  maps  3-D  points  x 
to  the  2-D  coordinates  of  a  particular  basis  (p,n) 
corresponding  to  oriented  point  O 

S^ix)  =  (J\\x-pf-(n-(x-p)f,n-(x-p))  (1) 

The  term  spin-map  comes  from  the  cylindrical  symme¬ 
try  of  the  oriented  point  basis;  the  basis  can  spin  about 
its  axis  with  no  effect  on  the  coordinates  of  points  with 
respect  to  the  basis. 

Each  oriented  point  O  on  the  surface  of  an  object 
has  a  unique  spin-map  Sq  associated  with  it.  When 
So  is  applied  to  all  of  the  other  points  on  the  sur¬ 
face  of  the  object  H  a  set  of  2-D  points  is  created. 
We  will  use  the  term  spin-image  Iq  to  refer  to 
the  result  of  applying  the  spin-map  Sq  to  the  set  of 
points  on  A  spin-image  is  a  description  of  the 
shape  of  an  object  because  it  is  the  projection  of 
the  relative  position  of  3-D  points  that  lie  on  the 
surface  of  an  object  to  a  2-D  space  where  some  of 
the  3-D  metric  information  is  preserved.  Since 
spin-images  describe  the  shape  of  an  object  inde¬ 
pendently  of  its  pose,  they  are  object  centered 
shape  descriptions. 

Correspondences  are  established  between  oriented 
points  by  comparing  spin-images.  If  spin-images 
are  represented  as  a  set  of  2-D  points  then  compar¬ 
isons  will  have  to  be  made  between  points  sets:  a 
costly  and  ill-defined  operation.  Instead,  as 
explained  below,  spin-images  are  represented  as 
images  that  are  compared  through  correlation. 

To  create  the  spin-image  for  the  oriented  point  Don 
the  surface  of  an  object  H  the  following  procedure 
is  invoked.  For  each  point  x  on  the  surface  of  the 
object,  the  spin-map  coordinates  (a,p)  with  respect 
to  O  are  computed.  Next,  the  pixel  P  that  the  coor- 
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dinates  index  in  the  image  is  determined  by  dis¬ 
cretizing  (a,P).  Finally,  the  array  is  updated  by 
incrementing  the  pixels  surrounding  P  in  the 
image.  In  order  to  blur  the  position  of  the  point  in 
the  histogram  to  account  for  noise  in  the  data  and 
the  discrete  sampling  of  the  surfaces  in  the  scene, 
the  contribution  of  the  point  is  bilinearly  interpo¬ 
lated  to  the  four  pixels  surrounding  (a,P).  In  gen¬ 
eral,  the  pixel  size  is  set  to  two  times  the  resolution 
of  the  surface  mesh  (measured  as  the  average  of  the 
edge  lengths  in  the  mesh).  Figure  2  shows  some 
spin  images  for  a  CAD  object.  The  darker  the 
pixel,  the  more  points  have  fallen  into  that  particu¬ 
lar  bin. 

The  idea  of  spin-images  evolved  from  concepts 
used  in  geometric  hashing.  Our  initial  idea  was  to 
use  all  of  the  points  on  the  surface  of  an  object  in  a 
basis  geometric  hashing  algorithm.  Unfortunately, 
constmcting  coordinate  systems  from  all  tuples  of 
points  would  lead  to  a  combinatoric  explosion  in 
the  indexing  [11]  (given  the  large  number  of 
points  on  the  surface).  Furthermore,  coordinate 
systems  constructed  from  tuples  of  points  are  very 
sensitive  to  the  position  of  points  sensed  on  the  sur¬ 
face.  Instead,  we  decided  to  encode  the  position  of 
points  with  respect  to  a  2-D  basis  defined  with  one 
oriented  point,  in  order  to  reduce  the  combinatoric 
explosion  and  position  sensitivity.  After  matching 
points  by  using  a  hash  table,  we  determined  that  it 
would  be  just  as  effective,  and  much  more  efficient, 
to  simply  store  an  image  that  described  the  location 
of  other  points  with  respect  to  the  oriented  point, 
instead  of  performing  lookup  in  a  hash  table.  From 
this,  the  concept  of  a  spin-image  was  bom.  Using 
spin-images  to  match  points  opens  up  the  entire 
field  of  image  based  matching,  giving  us  powerful 
comparison  tools  such  as  image  correlation. 


Figure  2:  Some  example  spin  images  generated  for 
three  different  oriented  points  on  a  CAD  model  of 
a  valve. 


Figure  3:  Spin-images  generated  from  two 
different  samplings  of  a  model  of  a  femur. 

Although  the  samplings  are  different,  the  spin- 
images  generated  from  corresponding  points  are 
similar. 

Because  a  spin-image  is  a  global  encoding  of  the 
surface,  it  would  seem  that  any  disturbance  such  as 
clutter  and  occlusion  would  prevent  matching.  In 
fact,  this  representation  is  resistant  to  clutter  and 
occlusion,  assuming  that  some  precautions  are 
taken.  This  will  be  described  in  detail  in  Section  4.. 

3.  Comparing  Spin-Images 

Spin  images  generated  from  the  scene  and  the 
model  will  be  similar  because  they  are  based  on  the 
shape  of  objects  imaged.  However,  they  will  not  be 
exactly  the  same  due  to  variations  in  surface  sam¬ 
pling  and  noise  from  different  views.  For  example, 
in  Figure  3  the  vertex  positions  and  connectivity  of 
two  models  of  a  femur  are  different,  yet  the  spin- 
images  from  corresponding  points  are  similar.  A 
standard  way  of  comparing  linearly  related  images 
is  the  correlation  coefficient.  Because  the  correla¬ 
tion  coefficient  can  be  used  to  rank  point  corre¬ 
spondences,  correct  and  incorrect  correspondences 
can  be  differentiated. 

The  linear  correlation  coefficient  provides  a  simple 
way  to  compare  two  spin-images  that  can  be 
expected  to  be  similar  across  the  entire  image.  In 
practice,  spin  images  generated  from  range  images 
will  have  clutter  (extra  data)  and  occlusions  (miss¬ 
ing  data).  A  first  step  in  limiting  the  effect  of  clut¬ 
ter  and  occlusion,  is  to  compare  spin  images  only 
in  the  pixels  where  both  of  the  images  have  data.  In 
other  words,  the  data  used  to  compute  the  linear 
correlation  coefficient  is  taken  only  from  the  region 
of  overlap  between  two  spin  images.  In  this  case, 
knowledge  of  the  spin-image  generation  process  is 
used  to  eliminate  outliers  in  the  correlation  compu¬ 
tation. 
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Since  the  linear  correlation  coefficient  is  a  function 
of  the  number  of  bins  used  to  compute  it,  the 
amount  of  overlap  will  have  an  effect  on  the  corre¬ 
lation  coefficients  obtained.  The  more  bins  used  to 
compute  the  correlation  coefficient,  the  more  confi¬ 
dence  there  is  in  its  value.  The  variance  of  the  cor¬ 
relation  coefficient  is  included  in  the  calculations 
of  the  relative  similarity  between  two  images  so 
that  the  similarity  measure  between  pairs  of  images 
with  differing  amounts  of  overlap  can  be  com¬ 
pared.  An  appropriate  similarity  function  C  which 
we  use  instead  of  the  correlation  coefficient  to 
compare  spin-images  P  and  Q  where  N  is  the  num¬ 
ber  of  overlapping  bins  is 

C(P,Q)  =  (atanh(^?(P,Q)))^->.(^5;^j  (2) 

The  similarity  function  will  return  a  high  value  for 
two  images  that  are  highly  correlated  and  have  a 
large  number  of  overlapping  bins.  The  change  of 
variables,  a  standard  statistical  technique  ([4] 
Chapter  12)  performed  by  the  hyperbolic  arctan¬ 
gent  function,  transforms  the  correlation  coeffi¬ 
cient  into  a  distribution  where  the  variance  is 
independent  of  the  mean.  In  (2),  X  is  a  free  variable 
used  to  weight  the  variance  against  the  expected 
value  of  the  correlation  coefficient.  In  practice  X  is 
set  to  three. 

4.  Limiting  the  Effect  of  Clutter  and 
Occlusions 

In  real  scenes,  clutter  euid  occlusion  are  omnipres¬ 
ent.  Any  object  recognition  system  designed  for 
the  real  world  must  somehow  deal  with  clutter  and 
occlusion.  Some  systems  perform  segmentation 
before  recognition  in  order  to  separate  clutter  from 
interesting  object  data.  In  our  case,  the  effects  of 
clutter  are  manifested  as  a  corruption  of  the  pixel 
values  of  spin-images  generated  from  the  scene 
data.  To  some  extent,  the  effect  of  clutter  and 
occlusion  can  be  limited  by  setting  two  thresholds 
that  determine  which  points  contribute  to  spin- 
image  generation.  The  first  threshold  sets  the  maxi¬ 
mum  distance  between  the  oriented  point  basis  and 
a  point  in  the  mesh  contributing  to  the  spin-image. 
This  parameter  localizes  the  support  of  the  spin- 
image  to  a  sphere  around  its  oriented  point.  In  gen¬ 
eral  this  distance  threshold  is  set  to  the  size  of  the 
model  (average  distance  of  points  on  the  model 
from  its  centroid).  The  second  threshold  sets  the 


maximum  angle  between  the  oriented  point  basis 
surface  normal  and  the  surface  normal  of  other 
points  on  the  surface.  This  threshold  prevents  most 
points  that  will  be  self-occluded  from  contributing 
to  the  spin-image  without  specifying  a  viewing 
direction.  This  angular  threshold  is  usually  set  to 
90  degrees. 

In  order  to  analyze  the  effects  of  clutter,  we  have 
developed  a  simple  model  of  the  effect  of  clutter  on 
spin-images  under  the  assumption  that  objects  are 
spherical.  The  clutter  model  combines  the  angular 
and  distance  thresholds  explained  above  with  the 
fact  that  objects  of  non-zero  thickness  cannot  inter¬ 
sect  to  show  that  clutter  is  limited  to  connected 
regions  in  spin-images.  Because  of  limited  space, 
we  cannot  include  a  derivation  of  the  clutter  model, 
but  our  approach  to  clutter  analysis  is  sketched  in 
Figure  4.  We  are  currently  extending  the  clutter 
model  to  arbitrary  objects  by  setting  the  radii  of  the 
spheres  in  the  clutter  model  appropriately.  Similar 
reasoning  shows  that  the  effect  of  occlusion  is  also 
limited  to  connected  regions  in  spin-images. 

Clutter  and  occlusion  manifest  as  extra  and  missing 
points  in  the  scene  where  the  number  of  these 
points  is  bounded.  Therefore,  it  is  reasonable  to 
assume  that  the  total  change  of  any  pixel  in  a  scene 
spin-image  6;  that  is  corrupted  is  bounded  |5,|  <  8 . 
Let  the  number  of  corrupted  pixels  in  the  scene 
spin-image  be  N(^  and  the  total  number  of  pixels  be 
N.  If  the  model  and  scene  pixel  values  are  normal¬ 
ized  on  [0,1],  then  the  lower  bound  on  the  correla- 


Figure  4:  Theoretical  clutter  model.  Because 
objects  cannot  intersect  (a),  the  corruption  due  to 
clutter  is  limited  to  connected  regions  in  spin- 
images  (b).  This  results  in  a  lower  bound  on  the 
effect  of  clutter  on  the  correlations  coefficient  of 
two  spin-images  (c). 
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clutter  model  prediction 
,2 

Plb=  0.700 

N  =  392  5  =  0.40 


occlusion 


clutter - =  0.841 


Figure  5:  Experimental  verification  of  clutter 
model.  96  of  392  pixels  in  the  scene  spin-image  are 
corrupted  by  an  amount  5  less  than  0.40.  The 
correlation  coefficient  for  the  two  images  (0.841)  is 
well  above  the  lower  bound  (0.700)  predicted  by 
the  clutter  model.  In  the  scene,  the  white  lines 
indicate  clutter  and  occlusion  of  the  model  and  are 
not  part  of  the  original  data. 


tion  coefficient  when  comparing  model  and  scene 
spin-images  is 


Plb  =  +  (3) 

where  is  the  variance  of  the  pixels  in  the  model 
spin-image.  Hence,  the  worst  case  effect  of  clutter 
and  occlusion  grows  sub-linearly  with  the  area  of 
corruption  in  the  scene  spin-image.  Since  clutter 
and  occlusion  cannot  corrupt  an  entire  spin-image 
and  the  effect  of  the  corruption  on  the  correlation 
coefficient  is  bounded,  it  can  be  concluded  that 
matching  of  spin-images  is  only  moderately 
affected  by  clutter  and  occlusion.  Figure  5  vali¬ 
dates  our  clutter  model  using  spin-images  from  a  a 
real  scene  with  clutter  and  occlusion. 


5.  Generating  Point  Correspondences 

The  similarity  measure  (2)  provides  a  way  to  rank 
correspondences  so  that  only  reasonable  corre¬ 
spondences  are  established.  Before  recognition 
(off-line),  spin-images  are  generated  for  all  points 
on  the  model  surface  mesh  and  stored  in  a  spin- 
image  stack.  At  recognition  time,  a  scene  point  is 
selected  randomly  from  the  scene  surface  mesh  and 
its  spin-image  is  generated.  The  scene  spin-image 
is  then  correlated  with  all  of  the  images  in  the 
model  spin-image  stack  and  the  similarity  mea¬ 


sures  ((2))  for  each  image  pair  are  calculated  and 
inserted  in  a  histogram.  As  explained  below,  the 
images  in  the  model  spin-image  stack  with  high 
similarity  measure  when  compared  to  the  scene 
spin-image  produce  model/scene  point  correspon¬ 
dences  between  their  associated  oriented  points. 
This  procedure  to  establish  point  correspondences 
is  repeated  for  a  random  sampling  of  scene  points 
that  adequately  cover  the  scene  surface.  Depending 
on  the  complexity  and  amount  of  clutter  on  the 
scene,  this  number  can  vary  between  one  tenth  and 
one  half  of  the  points  in  the  scene.  The  end  result  is 
a  list  of  model/scene  point  correspondences  that 
are  then  filtered  and  grouped  in  order  to  compute 
transformation  from  model  to  scene. 

Possible  corresponding  model  points  are  chosen  by 
finding  the  upper  outliers  in  the  histogram  of  simi¬ 
larity  measures  for  each  scene  point.  This  method 
of  choosing  correspondences  is  reliable  for  two 
reasons.  First,  if  no  outliers  exist,  then  the  scene 
point  has  a  spin-image  that  is  very  similar  to  all  of 
the  model  spin-images,  so  definite  correspon¬ 
dences  with  this  scene  point  should  not  be  estab¬ 
lished.  Second,  if  multiple  outliers  exist,  then 
multiple  model  points  are  similar  to  a  single  scene 
point,  so  should  be  considered  in  the  matching  pro¬ 
cess.  We  use  a  standard  method  for  detection  of 
outliers  in  a  histogram  ( [4]  Chapter  1);  correspon¬ 
dences  that  have  similarity  measures  that  are 
greater  than  the  upper  fourth  plus  three  times  the 
fourth  spread  of  the  histogram  are  statistical  outli¬ 
ers.  Figure  6  shows  a  similarity  measure  histogram 
with  detected  outliers. 

During  matching,  a  single  point  can  be  matched  to 
more  than  one  point  for  two  reasons.  First,  symme¬ 
try  in  the  data  and  in  spin  image  generation  may 
cause  two  points  to  have  similar  spin-images.  Sec¬ 
ond,  spatially  close  points  may  have  similar  spin- 
images.  Furthermore,  if  an  object  appears  multiple 


Figure  6:  Similarity  measure  histogram. 
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times  in  the  scene,  then  a  single  model  point  will 
match  multiple  scene  points. 

During  matching,  some  points  selected  from  scene 
clutter  may  be  incorrectly  matched  to  model 
points.  However,  given  the  numerous  correspon¬ 
dences,  it  is  possible  to  reason  about  which  corre¬ 
spondences  £ire  actually  on  the  model  based  on 
properties  of  the  correspondences  taken  as  a  group. 
This  integral  approach  is  robust  because  it  does  not 
require  reasoning  about  specific  point  matches  to 
decide  which  correspondences  are  the  best.  This 
approach  is  in  contrast  to  hypothesize  and  test  and 
alignment  paradigms  of  recognition  where  the 
minimal  number  of  correspondences  required  to 
match  model  to  scene  are  proposed  and  then  veri¬ 
fied  through  some  other  means. 

First,  similarity  measure  is  used  to  remove  unlikely 
correspondences.  All  correspondences  with  simi¬ 
larity  measures  that  are  less  than  some  fraction  of 
the  maximum  similarity  measure  of  all  of  the  cor¬ 
respondences  are  eliminated.  In  practice,  this  frac¬ 
tion  is  set  to  one  half. 

The  second  method  for  filtering  out  unlikely  corre¬ 
spondences  uses  geometric  consistency  which  is  a 
measure  of  the  likelihood  that  two  correspondences 
can  be  grouped  together  to  calculate  a  transforma¬ 
tion  of  model  to  scene.  If  a  correspondence  is  not 
geometrically  consistent  with  other  correspon¬ 
dences,  then  it  cannot  be  grouped  with  other  corre¬ 
spondences  to  calculate  a  transformation,  and  it 
should  be  eliminated. 


Figure  7:  Three  scene  points  and  their  best 
matching  model  points  shown  with  associated  best 
matching  spin-images  for  a  scene  containing  a 
model  of  the  head  of  Venus. 


The  geometric  consistency  of  correspondences  C; 
=  [sj,mj]  and  C2  =  [S2,m2]  is  measured  by  com¬ 
paring  the  spin-map  coordinates  ((!))  of  corre¬ 
sponding  points. 


=  2 


(4) 


=  max(dg^{Ci,  C2),  dg^(C2,  C,)) 

Normalized  distance  dg^  between  spin-map  coordinates 
is  used  because  it  is  a  compact  way  to  measure  the  con¬ 
sistency  in  position  and  normals.  Since  dg^.  is  not  sym¬ 
metric,  the  maximum  of  the  distances  is  used  to  define 
the  geometric  consistency  distance  Dg^. 

To  filter  out  correspondences  based  on  geometric 
consistency,  correspondences  that  are  not  geomet¬ 
rically  consistent  with  at  least  one  quarter  of  the 
correspondences  in  the  list  are  eliminated.  The  end 
result  after  filtering  on  similarity  measure  and  geo¬ 
metric  consistency  is  a  list  of  correspondences  that 
are  the  most  likely  to  be  correct.  In  practice  this 
number  is  between  20  and  50  correspondences. 
Figure  7  shows  three  of  the  best  correspondences 
and  matching  spin-images  between  a  model  of  the 
head  of  the  goddess  Venus  and  a  scene  containing 
it.  The  next  step  is  to  group  these  correspondences 
into  sets  that  can  be  used  to  compute  transforma¬ 
tions. 


6.  Grouping  Correspondences 

Single  correspondences  cannot  be  used  to  compute 
a  transformation  from  model  to  scene  because  an 
oriented  point  basis  encodes  only  five  of  the  six 
necessary  degrees  of  freedom.  At  least  two  ori¬ 
ented  point  correspondences  are  needed  to  calcu¬ 
late  a  transformation  if  position  and  normals  are 
used.  To  avoid  combinatoric  explosion,  geometric 
consistency  is  used  to  cluster  the  correspondences 
into  a  few  groups  from  which  plausible  transforma¬ 
tions  are  computed.  Since  many  correspondences 
are  grouped  together  and  used  to  compute  a  trans¬ 
formation,  the  resulting  transformation  is  more 
robust  than  one  computed  from  a  few  correspon¬ 
dences. 

We  cluster  correspondences  based  on  a  measure  of 
geometric  consistency  Wg^.  that  is  the  geometric 
consistency  distance  between  two  correspondences 
(4)  augmented  by  a  weight  that  promotes  cluster¬ 
ing  of  correspondences  that  are  far  apart. 
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d,c(C„C^) 


(5) 


Initial  Correspondences 


Spread  Correspondences 


1  -e 

=  max(Wg^iCp  C2),Wg^(C2,  Cl)) 

Wg^  will  be  small  when  two  correspondences  are 
geometrically  consistent  and  far  apart.  The  mea¬ 
sure  of  geometric  consistency  between  a  corre¬ 
spondence  C  and  a  cluster  of  correspondences 
{Ci,...,  CJ  is 

{Cp  ....  C„})  =  max  (Wg^(C,  C,))  (6) 

Given  a  list  of  correspondences  L  =  (Cj,...,  CJ, 
the  clustering  procedure  for  each  correspondence  is 
as  follows:  Select  a  seed  correspondence  Q  in  L 
and  initialize  a  cluster  G,-  =  {CJ.  Find  the  corre¬ 
spondence  Cj  in  L,  for  which  WgJCj,Gi)  is  a  mini¬ 
mum.  Add  Cj  to  Gi  if  WgJCj,Gi)  <  Tg^  where  the 
threshold  Tg^.  is  set  to  the  size  of  the  model.  Repeat 
until  no  more  correspondences  can  be  added  to  G,. 
The  clustering  procedure  is  performed  for  each 
correspondence  in  L,  and  the  end  result  is  n  clus¬ 
ters,  one  for  each  correspondence  in  L.  This  clus¬ 
tering  algorithm  allows  a  correspondence  to  appear 
in  multiple  clusters  which  is  necessary  to  handle 
model  symmetry.  For  example,  the  CAD  model  in 
Figure  2  has  a  plane  of  symmetry  resulting  in  two 
feasible  transformations.  Correspondences  along 
the  plane  of  symmetry  contribute  to  two  distinct 
transformations. 

A  plausible  transformation  T  from  model  to  scene 
is  calculated  from  each  cluster  of  corre¬ 

spondences  by  minimizing 

Instead  of  using  points  and  normals  in  the  minimi¬ 
zation  of  (7),  we  use  only  the  position  of  the  ori¬ 
ented  points.  This  allows  us  to  use  a  well  defined 
algorithm  for  finding  the  best  rigid  transformation 
that  aligns  two  point  sets  [6].  The  transformations 
and  associated  correspondence  clusters  are  then 
input  into  a  verification  procedure. 

7.  Verification 

The  purpose  of  verification  is  to  find  the  best  trans¬ 
formation  of  model  to  scene  by  eliminating 
matches  that  are  inconsistent  when  all  of  the  scene 
data  is  compared  to  all  of  the  model  data.  Our  veri¬ 
fication  algorithm  is  a  formulation  of  the  iterative 


Figure  8:  During  verification,  initial 
correspondences  are  spread  over  two  views  (one 
wireframe,  the  other  shaded).  Correspondences  are 
prevented  from  being  established  outside  the 
overlap  of  the  views. 

closest  point  algorithm  [1]  [16]  that  can  handle 
partially  overlapping  point  sets  and  arbitrary  trans¬ 
formations  because  is  initialized  with  a  transforma¬ 
tion  generated  from  correspondences  determined 
by  matching  of  spin-images. 

Verification  starts  with  an  initial  set  of  point  corre¬ 
spondences  from  which  the  transformation  of 
model  to  scene  is  computed  and  then  applied  to  the 
model  points.  For  each  correspondence,  new  corre¬ 
spondences  are  established  between  the  nearest 
neighbors  of  the  model  point  and  nearest  neighbors 
of  the  corresponding  scene  point  if  the  distance 
between  closest  points  is  less  than  a  threshold 
By  finding  scene  points  that  are  close  to  model 
points,  this  step  grows  the  correspondences  from 
those  correspondences  already  established.  The 
transformation  based  on  the  new  correspondences 
is  computed  and  then  refined  using  traditional  ICP. 
The  growing  process  is  repeated  until  no  more  cor¬ 
respondences  can  be  established. 

The  threshold  Dj,  in  the  verification  stage  (that  sets 
the  maximum  distance  by  which  two  points  can 
differ  and  still  be  brought  into  correspondence)  is 
set  automatically  to  two  times  the  resolution  of  the 
meshes.  This  threshold  allows  for  noise  but  pre¬ 
vents  establishment  of  correspondences  in  regions 
where  the  data  sets  do  not  overlap. 

Our  algorithm  grows  patches  of  correspondences 
between  the  model  and  the  scene  from  the  initial 
correspondences  and  a  cascade  effect  occurs.  If  the 
transformation  is  good,  a  large  number  of  points 
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will  be  brought  into  correspondence;  if  the  trans¬ 
formation  is  bad,  the  number  of  correspondences 
will  remain  close  to  the  original  number.  There¬ 
fore,  a  good  measure  of  the  validity  of  the  match  is 
the  number  of  correspondences  after  verification. 
Since  a  large  number  of  correspondences  are  used 
to  compute  the  final  transformation,  the  alignment 
will  be  more  accurate  than  that  computed  through 
matching  of  spin-images. 

Figures  illustrates  how  initial  correspondences, 
established  by  matching  spin-images,  are  spread 
over  the  surfaces  of  two  range  views  of  a  plastic 
model  of  the  head  of  the  goddess  Venus.  The  corre¬ 
spondences  are  established  only  in  the  regions 
where  the  two  surface  meshes  overlap,  thus  pre¬ 
venting  a  poor  registration  caused  by  correspon¬ 
dences  being  established  between  non-overlapping 
regions. 

8.  Multiple  Models 

A  strong  property  of  our  recognition  algorithm  is 
that  it  permits  simultaneous  recognition  of  multiple 
models.  Recognition  with  multiple  models  is  simi¬ 
lar  to  recognition  with  one  model  except  that  each 
scene  point  is  compared  to  the  spin-images  of  all  of 
the  models.  The  rest  of  the  algorithm  is  the  same 
except  that  correspondences  with  model  points 
from  different  models  are  prevented  from  being 
clustered. 

Figure  9  demonstrates  the  simultaneous  recogni¬ 
tion  of  two  models  of  free-form  shape.  The  femur 
and  pelvis  model  were  acquired  using  a  CT  scan  of 


the  bones,  followed  by  surface  mesh  generation 
from  contours.  The  scene  was  acquired  using  a 
structured  light  range  camera.  Both  the  mod¬ 
els  and  scene  data  were  processed  by  removing 
long  edges  associated  with  step  discontinuities, 
applying  a  “smoothing  without  shrinking”  filter 
[14],  and  then  applying  a  mesh  simplification 
algorithm  that  preserves  the  shape  of  objects  in  the 
scene  while  evenly  distributing  the  points  over  the 
its  surface  [8].  Our  algorithm  was  able  to  recog¬ 
nize  the  objects  even  in  the  presence  of  extreme 
occlusion;  only  20%  of  the  surface  of  the  pelvis  is 
visible.  This  result  also  demonstrates  the  recogni¬ 
tion  of  objects  sensed  with  different  sensing 
modalities. 

9.  Results 

Our  main  application  domain  is  interior  modeling. 
In  interior  modeling,  objects  are  recognized  in 
range  images  of  complex  industrial  interiors.  By 
recognizing  objects,  a  semantic  meaning  is  associ¬ 
ated  with  the  objects  in  the  scene,  setting  the  stage 
for  high-level  robotic  interaction.  For  example,  by 
recognizing  a  valve  in  the  scene,  a  robot  can  be 
given  a  high-level  commands  such  as  “turn  off  the 
valve”  [9]. 

Figure  10  shows  the  result  of  recognizing  four  dif¬ 
ferent  industrial  objects  in  cluttered  industrial 
scenes.  The  surface  mesh  models  were  generated 
by  CAD  drawings  using  finite  element  software  to 
tessellate  the  surface  of  the  objects.  The  scene 
images  were  acquired  with  a  Perceptron  5000 
scanning  laser  rangefinder.  Before  recognition,  the 


models  scene  recognition  result 


Figure  9:  The  simultaneous  recognition  of  the  models  of  a  femur  and  a  pelvis  bone  in  a  range  image.  The 
femur  and  pelvis  are  recognized  even  in  the  presence  of  extreme  occlusion  (only  20%  of  the  pelvis  surface 
is  visible)  and  clutter.  From  left  to  right  are  shown  the  pelvis  and  femur  models  acquired  with  CT,  the 
scene  image  from  a  stmctured  light  range  camera,  and  then  two  views  of  the  recognition  results  where  the 
scene  data  is  shaded  and  the  models  are  in  wireframe.  This  result  demonstrates  simultaneous  recognition 
of  free-form  objects  acquired  using  different  sensing  modalities. 
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Figure  10:  The  recognition  of  industrial  objects  in  complex  scenes.  On  the  left  are  shown  wireframe 
models  which  were  created  from  CAD  drawings  using  finite  element  software  for  surface  tessellation.  In 
the  middle  are  shown  the  intensity  images  acquired  when  a  scanning  laser  rangefinder  imaged  the  scene. 
On  the  right  are  shown  the  recognized  models  (shaded)  superimposed  on  the  scene  data  (wireframe).  These 
results  demonstrate  the  recognition  of  complicated  symmetric  objects  in  3-D  scene  data  containing 
extreme  clutter  and  occlusions.  All  of  the  scene  data  points  are  used  in  the  recognition  and  no  object/ 

background  segmentation  is  performed. 
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scene  data  is  processed  to  remove  long  edges  and 
small  surface  patches,  smoothed  and  simplified.  In 
all  examples,  the  scene  data  is  complex  with  a 
great  deal  of  clutter.  Furthermore,  all  the  models 
exhibit  symmetry  which  makes  the  recognition 
more  difficult,  because  a  single  scene  point  can 
match  multiple  model  points. 

In  addition  to  these  results,  we  have  generated 
results  from  multi- view  merging  and  alignment  of 
terrain  maps  [7]. 

10.  Future  Work 

We  have  presented  a  recognition  algorithm  that  is 
based  matching  spin-images  generated  using  ori¬ 
ented  points  on  the  surface  of  an  object.  Although 
spin-images  are  global  representations,  they  are 
robust  to  clutter  and  occlusions.  We  have  demon¬ 
strated  our  algorithm’s  ability  to  handle  clutter  and 
occlusions  through  recognition  of  objects  in  com¬ 
plex  cluttered  scenes. 

In  the  future,  we  will  extend  the  algorithm  to  rec¬ 
ognize  multiple  objects  simultaneously  from  a 
library  of  models.  This  will  require  efficient  meth¬ 
ods  for  determining  which  models  in  the  library  are 
present  in  the  scene.  We  have  determined  that  all  of 
the  spin-images  for  a  model  lie  on  a  2-D  manifold 
in  the  high-dimensional  spin-image  space.  It  is 
likely  that,  through  the  use  of  eigen-space  com¬ 
pression,  this  manifold  can  be  projected  onto  a 
much  lower  dimensional  manifold.  Rapid  determi¬ 
nation  of  which  models  appear  in  the  scene  could 
then  be  determined  by  projection  of  scene  spin- 
images  onto  a  manifold  generated  for  each  model 
in  the  library.  This  eigen-spin-image  approach  is 
similar  to  parametric  appearance  based  matching. 
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Abstract 

A  new  computational  approach  to  the  concept  of  me¬ 
dialness  and  skeletonization  of  a  grey  level  image  set  is 
presented,  motivated  by  the  original  notion  of  the  Blum 
“Grassfire”  Transform  definition  for  the  skeleton  of  a  bi¬ 
nary  set.  A  dynamic  programming  algorithm  generates  a 
unit  vector  field  in  the  interior  of  an  image  set  surrounded 
by  a  boundary  curve  whose  integral  curves  are  the  tra¬ 
jectories  of  the  contracting  boundary  curve  points.  The 
positive  divergence  of  this  unit  vector  field  is  used  as  a 
measure  of  “medialness” ,  while  negative  divergence  of  the 
vector  field  with  same  direction  but  whose  length  is  the 
local  grey  level  gradient  measures  “boundariness” .  Pre¬ 
sented  are  preliminary  results  for  generating  medialness, 
skeletons  and  using  these  for  registration  and  measuring 
shape  similarity. 

1  Introduction 

The  Medial  Axis  Transform  has  enjoyed  a  long  and 
rich  history  in  the  field  of  computer  vision  [2,  3,  4,  6,  5] 
particularly  related  to  object  representation  in  images. 
A  number  of  definitions  have  been  given  for  essentially 
the  same  transformation  of  an  image  set  into  its  skeleton. 
One  of  the  original  definitions  for  the  medial  axis  trans¬ 
form  [1,  7]  reduces  an  object  in  a  binary  image  into  a  set 
of  interior  points  each  of  which  are  at  the  center  of  a  met¬ 
ric  contour  (e.g.,  a  sphere  for  Euclidean  metric)  wholly 
contained  in  the  object  and  intersecting  the  boundary  of 
the  object  in  two  or  more  locations.  Such  a  construction 
hcis  been  extended  to  grey  level  images  and  made  robust 
to  noise  by  using  a  multi-scale  approach  called  “Cores” 
[6,  5].  Effectively  the  notion  of  medialness  at  a  point  in 
a  grey  level  image  at  a  given  scale  is  the  summation  of 
edge  strength  along  the  metric  contour  centered  at  this 
point  and  whose  radius  is  at  the  given  scale:  the  “Core” 
is  defined  in  terms  of  ridges  of  medialness  in  scale-space. 

We  propose  a  complementary  approach  to  the  no¬ 
tion  of  medialness  which  can  provide  augmented  utility 
particularly  for  object  registration  and  determination  of 
shape  similarity.  This  paper  presents  preliminary  ideas 
and  results  in  this  direction.  The  intuitive  basis  for  me¬ 
dialness  that  we  present  goes  back  to  the  “grassfire  tech¬ 
nique”,  directly  quoting  Blum  and  Nagel  [4] 

Imagine  an  object  whose  border  is  set  on  fire.  The 
subsequent  internal  quench  points  of  the  fire  represent 
the  symmetric  [medial]  axis,  the  time  of  quench  for  unit 
velocity  propagation  being  the  radius  function. 

We  extend  this  concept  mathematically  by  assigning 
to  each  point  in  the  object  the  unit  vector  direction  that 


the  fire  is  propagating  as  it  passes  over  the  point.  Thus 
as  the  fire  burns  it  defines  a  unit  vector  field  across  the 
object.  For  a  uniform  object  the  integral  curves  for  this 
vector  field  are  simply  the  straightline  trajectories  of  the 
fire’s  progression  that  eventually  run  into  each  other  at 
the  quench  points.  At  the  quench  points  unit  vectors  are 
discontinuously  changing  in  direction:  hence  for  us  the 
medial  axis  arises  from  singularities  in  this  unit  vector 
field.  To  go  further,  we  study  the  measure  of  medial¬ 
ness  from  the  computation  of  divergence  of  this  vector- 
field,  and  ridge  lines  of  divergence  form  our  notion  of  the 
skeleton  of  an  object. 

The  extension  to  grey  level  images  follows  naturally 
by  interpreting  grey  value  as  density  where  higher  grey 
values  impede  more  the  progress  of  the  fire  from  the 
boundary  towards  the  interior  of  the  object.  This  leads  to 
unit  vectors  that  trace  out  paths  of  minimal  integrated 
density  from  the  boundary  to  the  interior.  In  general 
the  boundary  from  which  the  vectorfield  starts  need  not 
be  of  an  object,  it  can  be  the  border  of  an  arbitrarily 
shaped  aperture  placed  over  an  image  region.  The  in¬ 
tegrated  density  (i.e.,  integrated  grey  value)  defines  a 
height  function  inside  the  aperture,  and  the  unit  vec¬ 
tor  field  of  interest  is  parallel  to  the  gradient  vector  for 
this  height  function.  In  the  special  case  of  an  aperture 
whose  border  is  aligned  with  the  boundary  of  a  uniform 
grey  object  this  height  function  is  simply  the  value  of  the 
Euclidean  Distance  Transform.  By  computing  the  diver¬ 
gence  of  the  gradient  vectorfield  of  the  height  function 
(as  opposed  to  the  unit  vector  field)  boundary  informa¬ 
tion  gets  computed  from  changing  grey  level  information. 
In  combination  with  medialness  information  from  the  di¬ 
vergence  of  the  unit  vector  field,  this  provides  important 
information  for  distinguishing  exterior/interior  of  objects 
in  a  grey  level  image. 

We  show  various  examples  of  computing  divergence 
for  vectorfields  for  grey  level  objects  demonstrating  the 
computation  of  medialness  and  in  some  cases  boundaries. 
We  then  show  how  this  vectorfield  can  be  used  for  sim¬ 
ilarity  matching,  and  how  the  singularities  of  the  vec¬ 
torfield  (i.e.,  ridges  of  divergence  forming  the  skeleton) 
can  be  used  to  perform  an  initial  coarse  object  registra¬ 
tion,  with  the  rest  of  the  vectorfield  used  to  refine  this 
registration. 

2  Some  Basic  Examples 

Given  a  density  image  and  an  aperture,  a  height  func¬ 
tion  can  be  defined  at  each  point  interior  to  the  aperture 
as  the  integrated  density  (ID)  along  the  path  of  from  the 
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Figure  1:  (a)  Top  image  shows  uniform  grey  level  Rect¬ 
angle,  (b)  Left  image  shows  interior  vectorfield,  (cl  Right 
image  shows  divergence  of  interior  vectorfield  (middle 
grey  =  0  divergence,  light  grey  =  positive  divergence, 
dark  grey  =  negative  divergence. 


point  to  the  boundary  of  the  aperture  with  minimum  ID. 
We  generate  both  this  height  function  and  its  gradient  at 
each  location.  In  the  discrete  case,  this  is  calculated  for 
each  pixel  using  a  dynamic  programming  algorithm.  At 
each  point,  the  vector  to  an  8  neighbor  is  chosen  which 
minimizes  the  ID  of  this  vector  plus  the  height  at  the 
neighbor  point.  Using  a  priority  queue  sorted  by  height, 
these  values  are  calculated  in  0(n  logn)  time. 

Figure  la  shows  a  uniform  grey  level  rectangle  against 
a  dark  background.  Figure  lb  shows  the  unit  vector  field 
generated  from  the  inward  propagation  of  a  fire  set  at  the 
boundary  of  the  rectangle,  using  the  dynamic  program¬ 
ming  algorithm  described  above.  Figure  Ic  shows  a  grey 
level  representation  of  the  local  divergence  of  the  vec¬ 
torfield  in  Figure  lb  by  performing  convolution  with  the 
3x3  kernel  shown  in  Figure  2.  The  computation  of  con¬ 
volution  is  achieved  by  summing  each  of  the  respective 
dot  products  of  a  vector  in  the  kernel  with  the  vector  in 
the  image  it  overlays.  The  brighter  grey  values  in  Figure 
1(c)  show  the  ridges  of  divergence  values  depicting  the 
skeleton.  Unlike  a  binary  skeleton,  the  divergence  value 
at  each  point  on  the  skeleton  gives  the  “medialness”  at 
that  skeleton  point. 
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Figure  2:  Convolution  kernel  for  divergence  computa¬ 
tion. 

Figure  3a  shows  a  grey  level  rectangle  with  a  small  tri¬ 


Figure  3:  (a)  Top  image  shows  uniform  grey  level  Rect¬ 
angle  with  triangle  appended  to  bottom  edge,  (b)  Left 
image  shows  interior  vectorfield,  (c)  Right  image  shows 
divergence  of  interior  vectorfield  (middle  grey  =  0  di¬ 
vergence,  light  grey  =  positive  divergence,  dark  grey  = 
negative  divergence). 


angle  appended  to  one  of  its  sides.  Figure  3c  shows  the 
resulting  divergence  image  and  the  additional  appendage 
to  the  skeletal  ridge  as  a  result  of  the  additional  trian¬ 
gle.  While  the  skeleton  has  an  additional  segment,  the 
divergence  (i.e.,  “medialness”)  along  this  ridge  is  much 
smaller  furtner  away  from  the  triangle.  If  instead  a  tri¬ 
angle  of  smaller  size  were  appended  (e.g.,  on  the  order  of 
size  of  noise  to  the  shape  of  the  boundary  of  the  rectan¬ 
gle),  the  divergence  becomes  neglible  except  in  the  close 
vicinity  of  the  noise.  In  effect,  for  noise  along  the  bound¬ 
ary  shape,  the  main  skeletal  ridge  for  the  rectangular 
shape  remains  the  same  with  boundary  noise  producing 
disconnected  skeletal  ridges  only  in  the  spatial  vicinity 
of  the  noise. 

Figure  4a  shows  a  rectangle  composed  of  two  rectan¬ 
gles  each  of  two  significantly  different  uniform  grey  val¬ 
ues.  Figure  4b  shows  the  interior  vectorfield  generated 
for  an  aperture  surrounding  both  rectangles  together. 
Figure  4c  shows  that  the  resulting  computation  of  di¬ 
vergence  creates  two  disconnected  ridges  of  positive  di¬ 
vergence.  Separating  these  two  ridges  of  positive  diver¬ 
gence  is  a  dark  “boundary”  of  negative  divergence.  Al¬ 
though  preliminary,  this  phenomenon  shows  how  objects 
of  different  grey  value  composition  can  be  potentially  seg¬ 
mented  by  looking  at  ridges  of  positive  and  negative  di¬ 
vergence. 

The  left  of  Figure  5  shows  a  grey  level  image  for  a  set 
of  objects,  and  the  right  of  Figure  5  shows  divergence 
computations  for  the  interiors  of  each  object  with  aper¬ 
ture  respectively  being  at  the  boundary  for  each  object. 
The  next  section  shows  an  example  of  registration  be¬ 
tween  two  objects  in  these  images  (i.e.,  the  rectangular 
shapes  with  three  holes)  using  the  skeletal  ridge  and  in¬ 
terior  vector  field. 
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Figure  5:  (Left),  a  grey  level  image  of  an  assortment  of  shapes,  (Right),  divergence  computed  for  the  interior 
vectorfield  of  each  grey  level  object. 


Figure  4:  (a)  Top  image  shows  uniform  grey  level  Rect¬ 
angle  with  triangle  appended  to  bottom  edge,  (b)  Left 
image  shows  interior  vectorfield,  (c)  Right  image  shows 
divergence  of  interior  vectorfield  (middle  grey  =  0  di¬ 
vergence,  light  grey  =  positive  divergence,  dark  grey  = 
negative  divergence). 


3  Registration  and  Similarity  Between 
Objects 

Figure  6  and  Figure  7  show  closeups  of  the  objects 
and  respective  interior  vectorfields  and  divergence  com¬ 
putations,  to  be  registered. 

Registration  of  two  objects  is  achieved  in  four  steps. 
First,  the  skeletons  of  images  are  computed,  taken  as  the 
largest  connected  set  of  ridges  in  the  images.  Each  of 
these  is  then  collapsed  into  a  graph  whose  nodes  have 
degree  !=  2.  Associated  with  each  connected  pair  of 
nodes  is  a  curve  of  the  skeleton,  represented  as  a  set 


of  points.  Any  terminal  curve  (ending  with  node  degree 
=  1)  whose  average  divergence  is  not  above  a  thresh¬ 
old  set  as  a  percentage  of  the  average  divergence  of  all 
of  the  skeleton  curves  is  removed  and  the  skeleton  fur¬ 
ther  collapsed.  A  set  of  topology-preserving  matchings 
between  the  set  of  nodes  for  each  is  generated  in  a  brute- 
force  manner.  The  match  with  minimum  squared  dif¬ 
ference  between  the  lengths  of  matched  curves  is  then 
chosen.  The  rigid  transformation,  consisting  of  a  combi¬ 
nation  of  2-D  translation  and  rotation,  which  minimizes 
the  squared  distances  between  matched  nodes  is  calcu¬ 
lated.  This  transformation  is  applied  to  all  further  posi¬ 
tions  and  vectors  from  the  first  image. 

Next,  after  this  initial  coarse  registration,  the  matched 
curves  are  then  parameterized  by  integrated  divergence. 
Each  pixel  along  the  curve  in  the  image  to  be  transformed 
is  matched  with  a  location  on  the  associated  curve  in  the 
other.  This  location  may  be  non-integral,  but  because  we 
will  be  calculating  the  transformation  for  each  pixel  only 
in  the  second  image,  this  is  OK.  The  pixels  in  the  second 
image  can  be  partitioned  into  two  sets,  based  on  whether 
or  not  they  have  vector  incident  on  them,  i.e.  whether  or 
not  any  other  pixel’s  vector  points  to  them.  In  addition 
to  the  pixels  on  the  skeleton,  all  adjacent  pixels  which  do 
not  have  vectors  incident  on  them  are  also  mapped,  ad¬ 
justed  by  distance  from  the  skeleton  point  scaled  by  the 
ratio  of  the  divergences,  normalized  by  total  divergence 
along  the  curve.  Then,  for  each  matched  point  on  or  near 
the  skeletons,  the  path  in  the  vector  field  for  which  it  is  a 
source  is  followed,  stepping  by  pixel  in  the  second  image. 
At  each  step,  the  previous  matched  location  in  the  origi¬ 
nal  image  is  modified  by  its  current,  interpolated  vector, 
scaled  by  the  ratio  of  the  vector  lengths  in  the  two  images 
normalized  by  the  total  path  length.  Finally,  the  rest  of 
the  non-target  set  of  pixels  is  taken  in  descending  order 
of  divergence.  For  each,  its  path  is  followed,  finding  the 
corresponding  locations  until,  either  an  already  mapped 
pixel  is  found,  or  the  boundary  is  reached.  After  all  of 
the  points  in  the  non-incident  set  have  been  investigated, 
all  pixels  in  the  second  image  have  associated  with  them 


Figure  6:  Left  rectangle  shape  with  3-holes  from  Figure 
5. 


Figure  7:  Right  rectangle  shape  with  3-holes  from  Figure 
5. 


an  object.  Some  basic  examples  of  these  computations 

were  shown  and  outlined  were  methods  for  registration. 

Mesures  for  testing  the  accuracy  of  registration,  as  well 

as  shape  similarity,  were  proposed. 
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a  location  in  the  first,  as  well  as  a  vector.  The  locations 
represent  the  non-rigid  residual  transformation  after  the 
rigid  transformation  is  applied. 

Two  proposed  measures  for  accuracy  of  registration, 
as  well  as  shape  similarity,  using  the  interior  vectorfields 
for  two  objects,  are  as  follows: 


1 

||o6jl  n  ohj^W 


(Vobji  ■  Vobj2)^ 

obj  irsobj  2 


1 

||o6jT  n  obj2\\ 


•  \\Vobjl  -  V„bj2\\''  . 

obj  IHobj  2 


4  Conclusion 

This  paper  has  presented  preliminary  research  using 
a  complementary  approach  to  the  computation  of  me¬ 
dialness  using  divergence  of  an  interior  vectorfield  to 
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Abstract 

We  developed  an  algorithm  that  achieves  good  and  ro¬ 
bust  matching  between  two  contours  of  possibly  rather 
different  shapes.  We  describe  extensive  experiments  with 
real  images  of  ZD  objects,  and  demonstrate  excellent 
matching  results  for  curves  under  partial  occlusion,  for 
curves  describing  the  same  object  observed  from  differ¬ 
ent  viewpoints,  and  for  curves  describing  different  but 
related  objects.  Furthermore,  we  used  the  method  to 
compare  a  range  of  curves  taken  from  a  large  database 
of  real  images  of  various  toy  models.  We  then  used  the 
results  to  classify  the  data  with  an  automatic  hierarchi¬ 
cal  clustering  algorithm,  getting  excellent  results  which 
faithfully  captured  the  real  structure  in  the  data.  This 
serves  as  indirect  evidence  to  the  quality  of  our  matching 
algorithm. 

1  Introduction 

We  discuss  the  problem  of  contour  matching  in  the  con¬ 
text  of  learning  from  visual  examples.  In  a  nutshell,  we 
wish  to  reveal  the  structure  which  underlies  a  large  collec¬ 
tion  of  images  and  shapes;  for  this  purpose  we  shall  use 
hierarchical  clustering,  where  shapes  are  hierarchically 
categorized  according  to  their  mutual  similarities.  Thus 
we  need  to  accomplish  reasonable  matching  between 
weakly  similar  curves.  This  is  unlike  typical  model  based 
recognition  applications,  where  one  need  only  determine 
whether  the  model  curve  and  the  image  curve  are  the 
same,  up  to  an  image  transformation  (e.g.,  translation, 
rotation  and  scaling)  and  some  permitted  level  of  noise  [l ; 
4;  5;  6].  The  desired  qualitative  matching  is  also  unlike 
other  applications  of  curve  matching,  like  stereo  or  track¬ 
ing  [3]. 

We  do  not  have  a  precise  definition  for  what  amounts 
to  a  reasonable  matching  between  weakly  similar  curves. 
As  an  illustrative  example,  consider  two  curves  describ¬ 
ing  the  shape  of  two  different  mammals:  possibly  we 
would  like  to  see  their  limbs  and  head  correspondingly 
matched.  Thus  it  appears  that  we  would  like  to  match 
curve  sections  that  are  locally  similar.  We  will  therefore 
define  a  suitable  heuristic  measure  for  local  similarity, 
and  we  will  propose  a  matching  algorithm  that  maxi¬ 


mizes  this  measure. 

There  is  a  fundamental  difference  between  local  and 
global  methods  of  curve  matching.  Our  method  is  purely 
local,  since  we  would  like  to  apply  it  also  in  cases  of  occlu¬ 
sions,  where  partial  matching  is  desired.  Moreover,  while 
many  available  local  methods  assume  that  the  whole  vis¬ 
ible  image  curve  can  be  matched  with  some  portion  of  a 
pre-stored  model  curve,  we  avoid  this  assumption,  which 
is  justified  for  model  based  recognition  applications,  but 
not  in  the  present  study.  Instead,  our  problem  is  to  find 
subsections  on  two  given  curves,  which  are  ’’sufficiently” 
similar. 

We  are  motivated  by  an  application  that  lacks  a-priori 
knowledge,  as  in  [7].  However,  a  great  difference  between 
the  present  work  and  [7]  is  that  the  later  method  is  not 
local:  it  involves  using  the  total  length  of  the  contour. 

In  the  present  work  we  use  a  segment  representation, 
which  is  a  polygonal  approximation  of  the  (closed)  curve. 
Segments  are  matched  in  a  fine-to-coarse  multi-scaled 
approach.  More  specifically,  we  try  to  match  the  given 
curves  at  their  maximal  available  resolution,  but  we  al¬ 
low  for  segment  merging  which  is  expected  to  remove 
the  noise  and  fine  resolution  details.  We  transform  the 
similarity  optimization  problem  to  a  set  of  dynamic  pro¬ 
gramming  problems,  which  compete  for  the  best  solution. 

In  the  next  section  we  describe  our  matching  algo¬ 
rithm.  In  section  3  we  show  matching  results,  and  we  also 
show  an  application  to  image  classification.  The  perfect 
image  clustering  that  is  obtained  stands  as  an  evidence 
for  the  quality  of  our  matching  results. 

2  The  matching  algorithm 

Let  us  denote  by  A  —  {uj}  the  sequence  of  segments 
which  represent  a  linear  approximation  of  a  curve.  Each 
segment  is  defined  by  its  length  £(a,)  and  slope  0(a,). 
We  also  define  a  special  symbol  ©  that  can  be  inserted 
between  two  segments.  The  symbol  ©  represents  a  dis¬ 
continuity  in  the  contour,  or  a  ’’gap  segment”;  it  does 
not  have  length  or  slope  characteristics. 

Image  scaling  is  expressed  in  this  representation  by 
(.{ai)  — >  C  ■  £{ai)  Vi  and  image  rotation  is  expressed  by 
0{ai)  — >■  C  -|-  Vi. 
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In  addition  to  these  two  operations  on  A,  we  define  the 
following  operations  as  well: 

•  Interruption:  the  operation  of  inserting  a  gap  ele¬ 
ment  ©  into  A. 

•  Merge:  the  operation  of  replacing  m  consecutive  el¬ 
ements,  which  do  not  contain  0,  by  a  single  new 
element,  that  represents  their  vectorial  sum. 

•  Shift:  the  operation  of  (cyclic)  re-ordering,  i.e.,  Oj  +- 
a,+i  Vi.  For  a  closed  curve  with  n  segments  a„  <- 
oi,  and  for  an  open  curve  i  can  be  zero  or  negative. 

For  example,  the  set  {01,02,03,04,05,06},  which 
represents  a  closed  curve,  may  be  transformed  to 
{03,04+5,06,01,0,0,02}  by  two  shift  operations,  one 
merge  operation  and  two  interruptions. 

Next  we  define  a  measure  for  local  similarity  between 
contours.  Let  A  and  B  represent  two  given  contours. 
Then: 

"2 

5(A,5)  =  5]  s(oi,6.)  (1) 

»=ni 

where  ni  <  n2  are  valid  indices  in  A  and  B,  and  5(0,-,  6,) 
is  a  pairwise  segment  similarity  function  to  be  defined. 

Given  two  contours  which  are  represented  by  A  and  B, 
the  goal  of  the  matching  algorithm  is  to  transform  them 
into  A'  and  B'  for  which  the  local  similarity  S{A' ,  B')  is 
maximal. 


Figure  T.  Matching  of  two  sketches,  drawn  by  hand.  On  the 
right:  the  alignment  transformation  which  minimizes  two  thirds 
of  the  distances  between  the  matched  features  (see  text). 


The  motivation  behind  our  formulation  may  be  clari¬ 
fied  with  a  simple  example.  Figure  1  shows  two  hand- 
drawn  contours  that  are  matched  using  our  method.  Seg¬ 
ments  65  -  67  are  merged  into  one  segment  (denoted  b[), 
which  is  in  turn  matched  to  segment  03  (denoted  Uj). 
Hence,  in  this  case,  the  merging  mechanism  is  used  to 
compensate  for  noise  or  scale  (resolution)  differences.  On 
the  other  hand,  the  gap  mechanism  (©  insertion)  is  used 
to  compensate  for  a  long  occluded  sequence  of  features. 
For  example,  segment  615  lies  within  such  a  sequence, 
and  it  is  therefore  matched  with  a  gap  0. 

We  will  now  discuss  the  similarity  measure  in  more 
detail.  The  segment  similarity  function  s(a,  b)  from  Eq. 
(1)  is  defined  as 

s{a,b)  =  wi  Si{£{a),£{b))  +  Sff  (6{a),9{b))  (2) 

s(a,©)  =  s{Q,b)  =  W2  (3) 


where  a,b  ^  Q,  and  s(©,©)  is  not  defined.  In  (2)  we 
compute  separately  the  scale  similarity  and  orientation 
similarity  of  two  segments  a  and  6;  we  then  let  them  con¬ 
tribute  additively  to  the  total  segment  similarity,  with 
a  relative  weight  of  lui.  This  choice  is  somewhat  arbi¬ 
trary,  although  it  appeals  to  our  intuition.  Nevertheless, 
whichever  way  s(a,  6)  is  chosen,  we  require  that  it  would 
be  symmetric,  i.e.,  s(a,  6)  =  s(b,  a).  We  also  require  that 
W2  in  (3)  be  a  constant. 

Our  specific  choices  for  si  and  se  are  the  following: 


sg{a,b) 
st{a,  b) 


cos  {9{a)  —  9{b)) 
2  £{a)  £{b) 
£^a)  +  £'^ib) 


cos  25 


(4) 

(5) 


where  6  is  the  angle  in  between  the  vector  [£(a),^(6)] 
and  the  vector  [1,1]. 

The  value  of  s{a,  ©)  is  not  intended  to  measure  the 
similarity  between  a  segment  a  and  a  gap  segment  ©, 
which  is  meaningless.  Instead,  W2  is  a  matching  thresh¬ 
old.  When  s(a,  6)  <  W2  it  might  be  preferable  to  insert 
a  gap  element,  and  gain  at  least  the  value  of  s(a,©)  or 
s(^Q,  b).  Increasing  the  value  of  W2  implies  that  a  higher 
level  of  similarity  between  matched  segments  is  required, 
and  the  length  of  non  interrupted  matched  sequences  is 
consequently  reduced. 

So  far,  the  total  contribution  of  interruptions  is  deter¬ 
mined  only  by  the  number  of  gap  elements  that  have  been 
inserted.  However,  the  case  of  a  few  consecutive  gap  el¬ 
ements  should  lead  to  greater  similarity  as  compared  to 
the  case  when  the  same  number  of  gap  elements  is  uni¬ 
formly  spread  along  the  contour.  This  is  because  the 
first  case  could  arise  from  occlusion,  whereas  the  second 
case  arises  typically  from  curve  dissimilarity.  To  take 
this  factor  into  account,  we  modify  our  previous  defini¬ 
tion  of  the  contour  similarity  S{A,B)-,  we  now  add  to  the 
pairwise  segment  sum  a  penalty  term,  which  reduces  the 
similarity  if  the  gap  elements  are  not  connected.  Namely: 


n2 

S{A,  B)  =  ^  s{ai,bi)  -  Ng  -m  (6) 

t=ni 

where  Ng  is  the  number  of  connected  chains  of  gap  ele¬ 
ments,  and  W3  is  a  weight  factor. 

In  the  maximization  of  (6)  we  separate  between  global 
operations,  which  act  on  the  whole  curve  (rotation,  scal¬ 
ing  and  segment  enumeration  shift)  to  local  operations 
(interruption  and  merge).  The  global  parameters  define  a 
three  dimensional  parameter  space,  over  which  the  simi¬ 
larity  criterion  (6)  should  be  maximized.  We  take  a  quite 
conventional  approach ,  and  seek  an  approximate  solution 
among  the  N  ■  M  initial  alignments^  that  map  some  seg¬ 
ment  of  one  curve  to  a  segment  of  the  other  curve.  Here 
N  and  M  are  the  number  of  segments  of  A  and  B . 

The  choice  of  a  particular  initial  alignment  defines  the 
three  global  parameters.  For  each  of  the  N  ■  M  initial 

'There  is  another  factor  of  2  if  reflections  are  allowed. 
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alignments  we  start  the  evaluation  of  (6) ,  where  the  opti¬ 
mal  local  operations  are  found  by  dynamic  programming. 
More  specifically,  for  two  given  ordered  sets  A  and  B, 
which  are  globally  aligned  and  which  consist  of  N  and 
M  elements  respectively,  we  assign  an  array  RnxM-  The 
entry  holds  the  maximal  similarity  that  can  be 

achieved  when  the  first  i  elements  of  A  are  matched  with 
the  first  j  elements  of  B.  The  entries  of  R  are  updated 
in  a  proper  order  according  to  the  following  rule: 


The  second  strategy,  which  we  use  to  reduce  unneces¬ 
sary  computations,  is  statistical  filtering.  This  strategy 
is  complementary  to  the  competitive  strategy,  since  the 
competitive  strategy  is  not  effective,  and  even  wasteful, 
in  the  initial  stages  of  the  update  process.  Nevertheless, 
after  a  few  updating  steps,  we  can  already  get  a  rough 
estimate  of  the  global  parameters,  and  continue  the  com¬ 
petition  between  the  arrays  that  are  not  far  from  our 
estimate. 


=  max{ri,r2,r3}  (7)  3  Results 


where 

ri  =  ma,x{R[a,0\  + s{ai,l3j)} 

r2  =  max  {ii[a,  j]  d-  UI2  •  (*  -  a)  -  uja} 

0<Q'<* 

ra  =  max  {/i[i,  j3]  +  W2-  {j  -  P)  -  W3} 
a<p<3 

Wy  denotes  the  vectorial  sum  of  the  segments  (a;  -f 
1), . . . ,  y,  and  the  domain  Cl  is  defined  below. 

The  term  rj  denotes  the  similarity  score  that  can  be 
achieved  by  extending  a  previously  obtained  match.  The 
extension  may  consist  of  a  single  segment  addition  to 
each  curve,  or  the  addition  of  a  few  segments  which  are 
merged  together.  The  total  number  of  added  segments  is 
bounded  by  a  constant  K,  and  therefore: 

=  {a,/?  I  0  <  a  <  i,  0  <  /?  <  i,  {i-a)  +  {j-P)  <  K}. 

Thus  the  domain  Q  contains  no  more  than  K{K  —  l)/2 
elements. 

The  terms  r2  and  ra  in  equation  (7)  provide  an  alter¬ 
native  to  continuous  segment  matching.  The  alternative 
is  to  use  the  interruption  mechanism,  and  match  ^  seg¬ 
ments  of  one  curve,  with  ©’s  that  are  inserted  into  the 
other  curve.  This  operation  can  contribute  to  R[i,j]  only 
the  pre-defined  quantity  Wa-^  —  wa,  and  hence  it  will  be 
chosen  only  in  cases  where  the  continuous  matching  is 
poor  and  can  contribute  even  less. 

When  R  is  fully  updated,  the  path  in  R,  which  arrives 
to  the  entry  with  maximal  score,  represents  the  optimal 
segment  correspondence  between  the  two  curves. 

In  order  to  avoid  the  computation  of  all  the  dynamical 
arrays,  we  use  two  strategies.  The  first  strategy  makes 
the  update  process  competitive:  We  define  a  potential 
function  f{Ri),  which  decreases  monotonically  during 
the  process,  and  which  gives  an  upper  bound  on  the  val¬ 
ues  that  a  given  Ri  may  achieve  when  the  process  ends. 
In  the  field  of  AI,  f{R)  is  called  ’’optimistic  heuristic 
function” .  The  process  is  transformed  to  a  competitive 
one  by  choosing  at  each  step  a  different  array  to  be  up¬ 
dated,  namely,  the  array  Ri  for  which  f{Ri)  is  maximal. 
At  the  completion  of  the  update  step,  the  value  of  f{Ri )  is 
decreased.  Since  f{R)  is  based  on  optimistic  estimation, 
the  optimal  solution  cannot  be  missed. 


3.1  Pairwise  contour  matching 

Figure  2  demonstrates  the  local  nature  of  the  match¬ 
ing  method.  The  difference  in  length  between  the  out¬ 
lines  of  the  two  silhouettes,  which  in  this  example  re¬ 
sults  from  occlusion,  does  not  impede  the  correct  match¬ 
ing  of  the  common  parts.  This  is  due  to  the  fact  that 
our  method  does  not  require  global  image  normalization. 
Note  also  that  although  the  cow  in  the  two  images  was 
photographed  from  slightly  different  points  of  view  (note 
the  distance  between  the  front  legs  and  the  number  of 
ears),  the  matching  is  essentially  perfect. 

Wrong  matches  cannot  be  avoided,  but  they  are  usu¬ 
ally  easy  to  identify  since  they  typically  correspond  to 
large  distances  between  the  matched  features.  The  rea¬ 
son  behind  this  convenient  characteristic  of  errors  is  that 
the  matching  algorithm  depends  only  on  the  local  shape 
of  the  curve.  In  figure  2  the  mismatches  were  eliminated 
by  a  distance  threshold  (see  the  dashed  lines) . 


ii- 
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f 

1  ' 

Figure  2:  The  alignment  transformation  which  minimizes  2/3 
of  the  distances  between  matched  features.  Four  of  the  ignored 
matches  are  denoted  by  dashed  lines.  There  were  64  and  40 
features  extracted  from  the  two  contours,  of  which  36  pairs  were 
matched. 

The  next  example  (figure  3)  illustrates  the  qualitative 
nature,  or  the  flexibility,  of  our  matching  method.  Recall 
that  our  intended  application  is  classification,  where  it  is 
necessary  to  compute  the  similarity  between  images  of 
different  objects. 

The  last  example  shows  an  application  of  the  algo¬ 
rithm  to  match  images  taken  from  very  different  points 
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Figure  3:  Qualitative  match  between  different  objects.  Note  the 
correct  correspondence  between  the  feet  of  the  wolf  to  the  feet 
of  the  horse,  and  the  correspondence  between  the  tails.  Numbers 
appearing  on  the  image  of  only  one  object  denote  features  that 
were  not  matched  (2,26,27  on  the  occluding  contour  of  the  wolf, 
and  9,46  on  the  horse). 

of  view  (figure  4) .  The  matching  is  successful  if  the  two 
silhouettes  are  similar  enough  (as  2D  entities) .  Note  that 
preservation  of  shape  under  change  of  viewpoint  is  a  qual¬ 
ity  that  defines  ’’canonical  views”.  Using  our  similarity 
function  we  can  learn  these  views  from  examples. 


Figure  4:  Feature  correspondence  between  two  different  views 
of  an  object.  Note  the  large  forthshortening  effect,  that  does  not 
destroy  local  similarity.  Segment  merging  is  seen  at  point  16  of 
the  left  image,  and  points  7,22,34  of  the  right  one. 


We  note  that  all  the  examples  above  (including  the 
one  from  figure  1)  were  generated  with  the  same  values 
of  parameters:  wi  =  1,  W2  =  0.8,  W3  =  8.0,  K  =  4;  the 
parameters  were  not  tuned  to  accomplish  optimal  perfor¬ 
mance.  These  values  were  obtained  in  an  ad-hoc  fash¬ 
ion,  however,  and  not  by  systematic  optimization  (which 
is  left  for  a  future  research).  Nevertheless,  we  noticed 
that  the  matching  results  are  stable,  namely,  the  same 
results  are  obtained  under  quite  large  perturbations  of 
the  parameters’  values. 

3.2  Silhouette  classification 

Finally,  we  show  an  application  of  our  matching  method 
to  silhouette  classification.  Successful  classification  is  an 
indirect  evidence  for  the  correct  matching  obtained  be¬ 
tween  a  large  number  of  image  pairs  (there  were  4005 
matching  assignments  involved  in  the  study  described 
below). 

The  task  is  defined  as  follows:  90  images  of  6  differ¬ 
ent  objects  are  given.  The  objects  include  toy  niodels 
of  a  cow,  wolf,  hippopotamus,  two  different  cars,  and  a 
child.  Each  object  contributed  15  images,  taken  from 
different  points  of  view  (in  a  sector  range  of  40®  azimuth 
and  20®  elevation).  Using  automatic  image  segmenta¬ 
tion  techniques,  the  outlines  of  the  objects  were  extracted 
from  a  black  background,  and  automatically  matched  to 
each  other.  Based  on  the  feature  matching,  a  dissimilar¬ 
ity  measure  was  computed  for  every  pair  of  silhouettes. 
These  distances  served  as  input  to  an  automatic  hier¬ 
archical  clustering  algorithm  [2],  in  order  to  divide  the 
contours  into  subsets,  where  contours  in  the  same  subset 
are  more  similar  to  each  other  than  contours  in  different 
subsets. 

More  specifically,  we  computed  the  90  x  90  dissimi¬ 
larity  matrix  for  our  database  of  90  images.  Since  the 
matrix  is  symmetric  and  the  diagonal  elements  vanish, 
we  had  to  match  4005  image  pairs,  a  task  which  took 
6.5  hours  on  an  INDY  R4400  175Mhz  workstation  (5.8 
seconds  per  match,  on  average). 

The  distance  matrix  was  fed  into  a  clustering  algo¬ 
rithm  [2],  which  produced  the  dendrogram  of  figure  5.  In 
the  final  level  of  classification  (the  lowest  level  in  the  hi¬ 
erarchy),  the  90  images  were  grouped  precisely  according 
to  their  identity.  But  even  more  interesting  is  the  hierar¬ 
chical  structure  which  emerged.  The  scale  parameter  of 
the  clustering  algorithm  (temperature)  defines  the  level  of 
specification,  and  it  reflects  our  intuition  regarding  fam¬ 
ilies  of  objects.  This  structure  wouldn’t  have  emerged 
if  the  estimation  of  the  distance  between  weakly  similar 
shapes  was  not  reliable.  Therefore,  capturing  the  true 
hierarchical  structure  stands  as  an  indirect  evidence  for 
the  quality  of  our  matching  algorithm. 

The  clustering  algorithm  that  we  used  does  not  re¬ 
quire  the  embedding  of  our  data  points  (images)  in  some 
normed  space.  The  only  information  which  is  needed  is 
pairwise  distances.  This  is  in  contrast  with  most  other 
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clustering  algorithms,  including  various  kinds  of  itera¬ 
tive  algorithms,  that  perform  data  partitioning  through 
estimation  of  means  (e.g.,  using  deterministic  annealing). 
The  use  of  means  involves  two  main  assumptions:  that 
the  data  points  are  vectors  in  some  space,  and  that  their 
distribution  is  central.  The  algorithm  that  we  have  used 
avoids  these  two  assumptions. 

On  the  other  hand,  for  visualization  purposes  it  might 
be  useful  to  obtain  a  low  dimensional  representation  of 
the  images  as  points  in  a  vector  space.  Such  a  repre¬ 
sentation  is  not  always  possible,  and  can  be  misleading. 
One  popular  way  to  obtain  a  low  dimensional  vector  rep¬ 
resentation  of  proximity  data,  is  by  the  method  known 
as  Multi  Dimensional  Scaling,  which  is  a  non  linear  op¬ 
timization  method.  Here  we  adopt  another  approach  to 
facilitate  visualization:  to  each  image  i  we  assign  a  vec¬ 
tor,  whose  ^-th  component  is  the  distance  between  image 
i  and  image  k.  This  defines  an  embedding  of  N  images 
in  R^. 

Next,  we  lower  the  dimension  of  the  representation  us¬ 
ing  Principal  Component  Analysis.  For  the  set  of  90 
images  described  above,  the  results  of  these  manipula¬ 
tion  are  shown  in  figure  6.  For  illustration  purposes,  the 
grouping  of  points  was  done  manually,  and  the  process 
was  repeated  for  the  two  groups  of  animals  and  cars. 


Figure  5:  A  classification  tree  (dendrogram)  obtained  by  a 
hierarchical  clustering  algorithm,  using  the  pairwise  distances 
between  silhouettes.  Two  arbitrary  representatives  are  shown 
for  each  class.  The  automatic  classification  is  100%  correct, 
and  the  hierarchy  reflects  the  true  structure  in  the  database. 


Figure  6:  Visualization  of  the  mutual  similarities  between  the 
90  images.  At  the  highest  level,  groups  1,2,3  represent  the  im¬ 
ages  of  animals,  cars  and  child  respectively.  The  groups  which 
contain  more  then  one  object  can  be  projected  again  onto  their 
own  principal  directions.  The  grouping  was  done  manually,  for 
illustration  only. 
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Abstract 

This  paper  describes  a  method  for  recognizing  par¬ 
tially  occluded  objects  under  different  levels  of  illumi¬ 
nation  brightness  by  using  the  eigen-space  analysis. 
In  our  previous  work,  we  have  developed  the  “eigen- 
window”  method  to  recognize  the  partially  occluded  ob¬ 
jects,  and  have  demonstrated  that  the  method  works 
successfully  for  multiple  objects  with  specularity  under 
constant  illumination. 

In  this  paper,  we  modify  the  eigen-window  method 
for  recognizing  objects  under  different  illumination 
conditions  by  using  additional  color  information.  In 
the  proposed  method,  a  measured  color  in  the  RGB 
color  space  is  transformed  into  the  HSV  color  space. 
Then,  the  hue  of  the  measured  color,  which  is  invari¬ 
ant  to  change  in  illumination  brightness  and  direction, 
is  used  for  recognizing  multiple  objects  under  different 
levels  of  illumination  conditions. 

The  proposed  method  was  applied  to  real  images 
of  multiple  objects  under  different  illumination  con¬ 
ditions,  and  the  objects  were  recognized  and  localized 
successfully. 

1  Introduction 

Object  recognition  has  a  wide  variety  of  military 
and  civilian  applications.  Some  of  the  representa¬ 
tive  applications  include  bin-picking,  automatic  target 
recognition,  and  surveillance  and  monitoring.  Some 
of  the  earlier  work  in  this  domain  include  [1]  -  [2]. 
Despite  the  long  history  of  research,  these  applica¬ 
tions  still  provide  a  challenge  to  computer  vision  re¬ 
searchers.  The  main  difficulties  include  requirement 
for  real-time  processing,  difficulty  in  segmentation, 
and  difficulty  in  obtaining  appropriate  object  models. 

Recently,  visual  learning  methods  based  on  the 
eigen-space  analysis  have  shown  a  potential  to  solve 
some  of  these  difficulties  [3]  and  [9].  In  the  eigen- 
space  analysis,  object  models  are  learned  from  a  series 
of  images  taken  in  the  same  environment  as  in  the 
recognition  mode.  Thus,  the  difficulty  of  object  mod¬ 
eling  is  avoided  in  the  analysis.  Furthermore,  since  an 
object  model  is  stored  as  a  vector  in  a  low  dimensional 
eigen-space,  and  since  objects  are  recognized  by  com¬ 
paring  the  model  with  image  vectors,  computation  for 
object  recognition  can  remain  effectively  low  enough 
to  achieve  real-time  performance. 

Although  promising,  the  current  eigen-space  analy¬ 
sis  is  based  on  the  assumption  that  objects  are  not  oc- 


Figure  1:  Experimental  Setup 


eluded  in  images.  Therefore,  to  apply  the  eigen-space 
analysis  for  partially  occluded  objects,  we  proposed  to 
divide  appearances  into  small  windows,  referred  to  as 
eigen-windows  [13]  and  to  apply  eigen-space  analysis 
to  each  eigen- window.  The  basic  idea  is  that,  even  if 
some  of  the  windows  are  occluded,  a  large  number  of 
windows  is  still  visible  that  the  object  can  be  recog¬ 
nized  and  localized  in  the  images. 

One  drawback  of  the  eigen-window  method  is  that 
only  a  limited  number  of  images  can  be  used  for  learn¬ 
ing  object  models,  and  therefore,  all  possible  illumina¬ 
tion  directions  cannot  be  taken  into  account.  There¬ 
fore,  the  object  may  be  illuminated  from  a  different 
direction  at  the  recognition  mode,  resulting  in  incor¬ 
rect  recognition  results. 

In  this  paper,  to  overcome  that  drawback,  we  pro¬ 
pose  to  use  the  color  measurement  hue,  which  is  illu¬ 
mination  invariant,  in  the  eigen-window  method.  To 
demonstrate  the  effectiveness  of  the  proposed  method, 
we  applied  the  method  to  real  images  taken  under  dif¬ 
ferent  illumination  directions  and  brightness. 

2  Eigen-Window  Method 

In  this  section,  we  briefly  review  the  eigen- window 
method  that  we  have  proposed  [13]  to  overcome  lim¬ 
itations  of  the  original  eigen-space  analysis,  such  as 
image  shift,  occlusion,  noise,  and  scaling. 

2.1  Eigen-Space  Technique 

Let  M  be  the  number  of  the  images  zi,  Z2,  -  ■  • ,  zm 
in  a  training  set  related  to  each  rotation  of  view  points 
9 1  and  $2,  as  shown  in  Figure  1.  Each  image  Zi,  with 
dimensions  N  x  N,  has  been  converted  into  a  column 
vector  of  length  N^. 
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Input  Image 


Figure  2:  Eigen- Window  Technique. 

By  subtracting  the  average  brightness  c  of  all  the 
images,  we  obtain  the  training  matrix  of  the  size  of 

by  M, 

Z  =  [zi  -  C,  Z2  -  c,  ■■■,  ZM  -  c]  ■  (1) 

This  covariance  matrix  Q  =  ZZ'^  provides  a  series 
of  eigenvalues  A,-  and  eigenvectors  e,  (i  =  1,  ■  ■  • ,  N  ), 
where  each  corresponding  eigenvalue  and  eigenvector 
pair  satisfies: 

A,ei  =  Qci-  (2) 

For  reducing  memory  requirement,  we  ignore  eigen¬ 
vectors  corresponding  to  small  eigenvalues  ei{i  >  >)• 
These  eigenvectors  do  not  affect  object  recognition 
results  significantly.  Once  we  obtain  the  remaining 
eigenvectors,  we  can  construct  the  eigenvector  matrix 
E  =  [ei,  62,  •  •  • ,  e(]  which  projects  an  image  Zj  (di¬ 
mension  N^)  into  the  eigen-space  as  an  eigen-point  gt 
(dimension  1). 

gi  =  E'^{zi-c).  (3) 

The  eigen-space  analysis  can  drastically  reduce  the  di¬ 
mension  of  the  images  (iV^)  to  the  eigen-space  dimen¬ 
sion  (/)  while  preserving  enough  dominant  features  to 
reconstruct  the  original  images. 

2.2  Eigen- Window  Technique 

To  reduce  the  disturbance  effects  such  as  image 
shift  and  occlusion,  we  propose  to  select  small  windows 
in  the  original  images.  Each  of  the  selected  small  win¬ 
dows  is  then  analyzed  by  using  the  eigen-space  anal¬ 
ysis  as  described  in  the  previous  section.  We  call  this 
method  the  eigen-window  method.  Figure  2  shows  the 
overview  of  the  method. 

2.2.1  Training  Eigen-Windows 

The  training  set  of  eigen- windows  is  given  as: 

F  =  ,  (4) 

where  F'  denotes  the  collection  of  eigen-windows 
from  the  ith  training  image.  Each  F'  has  the  form 
r/j  _  c,  /2  -  c,  c],  where  fj  denotes  the  jth 

eigen  window  in  the  ith  training  irnage;  nj  denotes 
the  number  of  eigen-windows  in  the  ith  image,  and  c 
is  the  average  intensity  value  across  all  eigen-windows 
in  the  whole  training  set.  In  Figure  2,  the  white  square 
denotes  one  of  the  training  eigen- windows. 


2.2.2  Matching  Operation 

From  an  input  image,  a  set  of  input  eigen- window  im¬ 
ages  is  obtained: 

G  =  [gi  —  c,  g2  —  c,  gn  —  c]  t  (3) 

such  as  the  white  window  in  the  lower  left  image  in 

Figure  2.  .  .  .  •  j  j 

The  similarity  between  a  training  eigen- window  and 
an  input  eigen-window  is  evaluated  by  using  the  dis¬ 
tance  between  them  in  the  eigen-space.  Given  p  in¬ 
put  eigen-point  ipk  projected  from  input  eigen-wmdow 
gif  using  equation  (3) ,  we  try  to  find  a  training  eigen- 

point  $k  from  all  training  eigen-points  4  projected 
from  all  training  eigen-windows  /.  The  training  eigen- 
point  is  the  one  with  the  maximum  similarity  defined 
as: 

$k  =  argmin(||V>fc  -  (6) 

v<p 

where  ||x||  denotes  the  norm  of  x  using  L  1-norm  or  L2- 
norm.  We  denote  the  eigen- window  that  is  projected 
to  ^k  as  /fc.  The  eigen- window  fk  corresponds  to  the 
input  eigen-window  gk  ■ 

2.2.3  Voting  Operation 

The  previous  matching  operation  selects  a  set  of  train¬ 
ing  eigen- windows,  [/i ,  /21  •  ’  ’  >  fn]  corresponding  to  in¬ 
put  eigen- windows.  We  now  sort  the  selected  training 
eigen-windows  into  each  groups,  which  contains  win¬ 
dows  that  comes  from  the  same  training  images,  from 
a  corresponding  training  image: 

,  (7) 

where 

F*  =  {f\f  comes  from  training  image  i}  (8) 

We  then  prepare  a  pose  space  for  voting  from  the  es¬ 
tablished  correspondences.  In  this  operation,  we  con¬ 
sider  only  translation,  and  therefore  the  space  is  two 
dimensional  corresponding  to  a  [s,  y\  location  of  eigen- 
windows.  The  size  of  the  pose  space  is  set  to  be  twice 
that  of  the  input  image  size,  e.g.,  256  x  256  in  our 
examples.  The  pose  space  is  prepared  separately  for 

each  group  F’. 

Using  each  correspondence,  we  can  compute  tne  uii- 
ference  of  the  training  eigen- window’s  location  ^(fk) 
and  the  input  eigen- window’s  location  X(gk)- 
represents  [x,y]  location  of  the  eigen-window  /.)  The 

difference  is  given  as  X(5fc)  -  X(/fe). 

Then,  in  the  pose-space,  the  cell  that  represents 

this  distance,  X{gk)—X{^fk),  gets  a  vote.  To  avoid  the 
digitization  error,  all  of  the  5  x  5  neighbor  cells  around 
the  center  cell  get  a  vote  from  a  single  correspondence. 
We  repeat  this  operation,  using  all  correspondences 

for  a  group  F’  (all  the  correspondences  from  the  same 
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training  image.)  Then,  we  obtain  a  resulting  pose- 
space  for  each  of  the  groups  F' . 

Some  small  peaks  in  the  pose-space  are  due  to  noise; 
other  prominent  peaks  are  due  to  actual  objects  in 
an  input  image.  By  thresholding  these  peaks,  we 
can  eliminate  noise  peaks  and  extract  only  prominent 
peaks. 

2.2.4  Pose  Determination 

The  number  of  the  prominent  peaks  in  the  pose-space 
is  equal  to  the  number  of  objects  that  have  roughly 
the  same  rotation,  but  have  a  different  translation. 

By  retrieving  voted  pairs,  we  further  divide  the 
group  F'  into  sub-groups,  each  of  which  belongs  to 
each  prominent  peak,  i.e.,  an  isolated  object  in  the 
input  image. 

Since  the  training  set  is  sampled  along  the  rotation 
dimension,  there  exists  a  side  effect  of  small  object 
rotation  due  to  the  finite  sampling  interval.  To  obtain 
the  rotation  and  the  translation  precisely,  we  refine  the 
pose  estimate  via  a  least  square  minimization,  using 
the  pairs  in  each  sub-group. 

X{h)  =  RX{gk)  +  T,  (9) 

where  R  and  T  denote  the  small  rotation  and  transla¬ 
tion,  respectively. 

2.3  Selection  of  Effective  Eigen- Windows 

In  the  eigen-window  method,  it  is  very  important 
and  also  difficult  to  select  an  optimal  set  of  training 
eigen- windows.  If  all  of  initially  selected  windows  are 
used  as  the  eigen- windows,  two  problems  occur:  1)  the 
number  of  eigen- windows  becomes  very  large  and  stor¬ 
ing  them  requires  a  large  amount  of  memory,  and  2) 
due  to  the  similarity  among  eigen- windows,  the  match¬ 
ing  process  becomes  erroneous. 

In  this  section,  we  introduce  three  criteria  for  se¬ 
lecting  the  optimal  set  of  eigen- windows:  detectability, 
uniqueness,  and  reliability. 

The  detectability  measures  the  ease  of  detecting  a 
window  in  an  entire  image.  For  example,  a  window 
containing  corners  of  an  object  is  much  easier  to  detect 
than  one  containing  a  planar  region. 

Although  some  windows  are  easy  to  detect,  they 
may  be  similar  to  each  other.  This  situation  happens 
when  the  target  object  has  multiple  similar  corners. 
To  select  truly,  distinct  windows,  we  introduce  a  global 
goodness  measure  called  the  uniqueness  measure. 

In  addition,  the  reliability  measure  is  used  to  se¬ 
lect  windows  which  do  not  appear  and  disappear  with 
small  variation  of  object  pose  such  as  orientation  and 
translation. 

Using  these  three  measures,  we  can  obtain  the  op¬ 
timal  set  of  eigen- windows. 

2.3.1  Detectability:  Local  Goodness 

For  initial  selection  of  eigen-windows  in  images,  algo¬ 
rithms  to  select  feature  points  for  object  tracking  can 
be  used.  Tomasi  et.al.  proposed  to  use  the  following 


Figure  3:  A  Problem  in  Corner  Detection  with  Track- 
ability. 

2x2  matrix  as  the  trackability  measure  for  a  window 
X^[x,yf  [12]. 

(10) 

xeR 

where  I  represents  pixel  intensities  inside  the  window. 

This  matrix  G  has  two  eigenvalues  Ai  and  A2 .  The 
window  is  accepted  as  a  good  one,  if  the  equation 

mm(Ai,A2)>A.  (11) 

holds,  where  A  is  a  predefined  threshold.  The  measure 
works  well  for  detecting  most  of  the  important  corners. 

2.3.2  Uniqueness:  Global  Goodness 

By  using  the  detectability  measure,  we  can  select 
windows  which  contain  important  features.  Unfortu¬ 
nately,  however,  the  detectability  measure  does  not 
guarantee  the  global  uniqueness  of  the  selected  win¬ 
dows.  In  other  words,  some  of  the  selected  window 
may  have  corners  very  similar  to  corners  in  other  win¬ 
dows  (Figure  3).  These  windows  are  not  desirable  for 
recognition  and  localization  of  objects. 

We  define  the  global  goodness  of  windows  as  the 
uniqueness  of  an  eigen-window.  The  uniqueness  of 
each  window  can  be  measured  by  computing  similarity 
among  eigen- windows.  As  discussed  in  Section  2.2.2, 
the  similarity  between  a  training  eigen-window  and  an 
eigen-window  in  an  input  image  is  computed  by  using 
the  distance  between  them  in  the  eigen-space. 

We  can  use  the  same  measure  for  evaluating  the 
global  goodness  of  selected  windows,  i.e. ,  the  similarity 
among  training  eigen-windows. 

Sl,m  =  Ml  -  <Pm\\  <  Tsim-  (12) 

The  similarity  Si^m  between  two  training  eigen-points 
<j)i  and  (l>m,  which  are  projected  from  two  eigen- 
windows,  is  evaluated  by  using  the  equation.  If  the 
computed  similarity  is  less  than  a  certain  threshold 
Tsim,  then  the  two  eigen- windows  are  discarded  from 
the  training  set  (Figure  4). 

The  elimination  of  similar  eigen-windows  reduces 
the  size  of  a  training  set  effectively.  Also,  the  elimina¬ 
tion  makes  the  matching  process  more  robust.  That 
is  because  the  reduced  set  does  not  contain  similar 
eigen- windows,  and  therefore  the  matching  evaluation 
is  less  likely  to  be  fooled  by  similar  eigen-windows. 
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Figure  5:  Reliability  Evaluation  in  Eigen-space 


2.3.3  Reliability 

The  reliability  of  an  eigen-window  for  object  recogni¬ 
tion  can  be  inferred  by  measuring  how  much  the  eigen- 
point  which  corresponds  to  the  eigen-window  moves  in 
the  eigen-space  while  an  object  is  moved  slightly. 

For  example,  if  training  images  are  taken  by  rotat¬ 
ing  the  object  at  the  angle  step  IQdeg,  we  can  evalu¬ 
ate  the  reliability  by  rotating  the  object  slightly,  e.g., 
±hdeg,  for  each  of  the  training  images.  If  an  eigen- 
point  (j>j  remains  close  to  the  original  location  for  the 
training  image,  we  consider  the  eigen-point  to  have 
high  reliability.  The  reliability  i®  defined  as. 

Rf^'=  Y'  ||^f-^,-||<T.e,,  (13) 

(5ie±A> 

where  Trei  is  a  threshold  value  for  the  reliability  (Fig¬ 
ure  5). 

Some  of  the  windows  may  disappear  and  reappear 
while  the  object  is  rotated.  In  this  case,  we  simply 
discard  those  feature  points. 

3  Illumination  Invariance 

In  our  previous  work  [13],  we  have  shown  that  the 
eigen-window  method  can  successfully  recognize  and 
localize  an  object  in  b/w  input  images  which  contain 
multiple  objects  with  specularity,  even  if  the  input  im¬ 
ages  contain  significant  amount  of  noise,  occlusion,  im¬ 
age  shifting  and  scaling  change. 

However,  the  method  was  based  on  an  assumption 
that  the  location  and  brightness  of  a  light  source  are 
fixed.  Therefore,  the  method  did  not  take  into  ac¬ 
counts  shading  variation  such  as  highlights  on  object 


surfaces.  For  instance,  if  an  object  exhibits  specu¬ 
larity,  the  object  appearance  can  change  drastically 
with  different  illumination  directions,  which  confuses 
recognition  and  localization  of  the  object. 

To  overcome  this  limitation,  we  propose  to  use  an 
illumination  invariant  measure  for  the  eigen-window 
method.  By  using  the  illumination  invariant  measure, 
the  eigen-window  method  can  be  used  successfully  for 
recognizing  and  localizing  multiple  objects  under  dif¬ 
ferent  illumination  conditions. 

3.1  Illumination  Invariance:  Hue 

Instead  of  black-and-white  intensity  images,  we 
use  RGB  color  images  in  the  modified  eigen-window 
method.  Actually,  several  pieces  of  research  works 
were  done  on  the  color  indexing  in  the  past  [17]  -  [19], 
but  we  would  like  to  use  the  hue  criterion  for  its  sim¬ 
pleness,  and  a  color  image  measured  in  the  RGB  color 
space  is  converted  to  a  HSV  image  (H:  Hue,  S:  Satu¬ 
ration,  V:  Value).  In  these  three  parameters,  the  hue 
parameter  is  the  value  which  represents  color  infor¬ 
mation,  e.g.,  without  brightness.  Therefore,  the  hue 
is  not  affected  by  change  of  the  illumination  brightness 
and  direction  if  the  following  two  conditions  hold:  1) 
the  light  source  color  can  be  expected  to  be  almost 
white,  and  2)  a  saturation  value  of  object  color  is  suf¬ 
ficiently  large. 

The  original  color  of  object  X  is  transferred  to  be 
X’  =  s-X  +  t- 1  by  the  change  in  diffuse  shading  and 
specularity  as  shown  in  Figure  6.  s  and  i  represent 
a  relative  strength  of  the  diffuse  reflection  componeiit 
and  the  specular  reflection  component  of  the  color  X  , 
respectively.  If  the  two  conditions  mentioned  above 
are  true,  then  the  hue  of  X'  remains  the  same  as  that 

oiX.  ,  ,  , 

In  Figure  6,  object  color  is  represented  by  three 
color  components  Si,  S2,  and  S3.  In  the  RGB  color 
space,  those  three  color  components  are  Red,  Green 
and  Blue.  Then,  the  light  source  color  I  is  given  as  /  = 
(1, 1, 1).  To  define  hue,  saturation  and  intensity,  one 
pair  from  three  components.  Red,  Green,  and  Blue, 
have  to  be  assigned  to  and  Usually,  Red  and 
Green  are  assigned  as  S{  =  R  and  S2  =  G. 

We  conducted  a  simple  experiment  using  a  color 
test  chart  to  see  how  hue  is  affected  under  different 
levels  of  illumination  brightnesses.  The  result  is  shown 
in  Figure  7  and  Figure  8.  In  Figure  8  (b),  we  can  see 
that  hue  remains  almost  constant  over  a  wide  range  of 
illumination  brightness  for  many  color  blocks. 

However,  for  some  color  blocks,  the  value  of 
hue  does  change  with  different  levels  of  illumina¬ 
tion  brightnesses.  For  instance,  the  black-white  color 
blocks  in  the  last  row  of  the  color  chart  (color  blocks 
#30-#35),  red  (color  block  #12)  and  magenta  (color 
block  #24). 

That  is  because  the  saturation  of  color  blocks  #30- 
#35  is  not  sufficiently  large,  i.e.,  they  are  very  close  to 
gray.  Also,  hue  has  a  discontinuity  at  0  and  27r.  That 
is  the  reason  for  unstable  hue  of  the  color  blocks  #12 
and  #24. 

To  obtain  the  value  of  hue  reliably,  we  propose 
to  use  three  criteria:  intensity  value,  saturation,  and 
phase. 
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Figure  7:  Illumination  Constancy  with  a  Color  Test 
Pattern  Image. 


3.1.1  Intensity  Value 

To  eliminate  the  background  noise,  we  apply  a  thresh¬ 
old  value  for  the  intensity  value  as 

if  V  <  Vt  then  H  =  Q,  (14) 

where  V,Vt,  and  H  are  an  intensity  value,  the  thresh¬ 
old  value,  and  a  hue  value,  respectively.  If  measured 
color  is  not  bright  enough,  the  color  is  discarded. 
Then,  the  hue  value  is  set  to  a  predetermined  value, 
i.e.,  0. 

3.1.2  Saturation 

One  of  the  problems  shown  in  the  example  in  Section 

3.1  is  that,  if  object  color  is  close  to  gray,  then  hue 
value  of  the  color  is  not  stable.  The  reason  is  that, 
if  the  color  is  almost  gray,  the  object  color  in  5(52 
plane  exists  around  the  point  C  in  Figure  6.  That 
means  the  hue  angle  cannot  be  determined  robustly 
in  the  face  of  image  noise.  Therefore,  measured  color 
should  be  discarded  if  the  saturation  value  is  less  than 
a  certain  threshold  St  ■ 

if  S<St  then  H  =  0,  (15) 

where  5  is  the  saturation  value.  Using  the  equation, 
measured  color  close  to  gray  is  discarded  in  the  image. 

3.1.3  Phase 

The  other  problem  shown  in  the  example  in  Section 

3.1  is  that  color  close  to  red  has  a  hue  value  near  its 
discontinuity.  The  range  of  hue  value  is  from  0  to  27r, 
and  it  has  discontinuity  at  0  and  27r.  We  avoid  the 
discontinuity  effect  by  using  the  phase  threshold  value 
APt  as: 


if  H  <  APt  or  \\H  -  27r||  <  APt  then  H  =  0.  (16) 


Figure  8:  Color  Elements,  (a)  RGB  space;  (b)  Hue- 
Intensity  space 


In  the  examples  shown  in  Figure  7  and  Figure  8,  the 
red  color  element  may  be  neglected  with  this  criterion. 

It  is  important  that  the  discontinuity  of  hue  value 
depends  on  the  selection  of  the  color  components  5( 
and  5(.  In  the  next  section,  we  discuss  how  to  select 
the  color  components  5(  and  5(  to  be  able  to  find 
more  windows. 

3.2  How  to  Select  Color  Components  5( 
and  52 

Usually,  the  two  color  components  5(  and  5(  are 
set  to  R  and  G.  But  if  the  R  and  G  factors  are  used 
for  the  two  color  components,  the  discontinuity  of  hue 
appears  around  the  color  red  as  described  in  the  pre¬ 
vious  section.  Therefore,  if  red  is  the  most  important 
component  for  recognizing  the  objects,  the  use  of  R 
and  G  for  5(  and  5(  is  not  desirable. 

In  this  section,  we  show  how  to  choose  the  5(  and 
5^  from  RGB  components  so  that  we  can  select  more 
windows  to  be  used  as  eigen-windows  as  described  in 
Section  2. 

There  are  six  combinations  for  the  selection  of  5( 
and  5(  from  the  RGB  components.  Figure  9  shows  the 
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5=0.0585 


Figure  9;  Image  Invariance  and  Window  Selection, 
(a)  Original  Image;  (b)  Hue  Image  and  Feature  Points 
with  RG;  fc)  RB;  (d)  GR;  (e)  GB;  (f)  BR;  (g)  BG 


Figure  10:  Some  Example  Hue  Images,  (a)  Origmal 
Image  of  Mug;  (b)  Hue  Images  of  Mug;  (c),(d)  Birds; 
(e),(f)  Tylenol 


result  of  window  selection  by  using  each  combination 
oiS[  and  5^.  In  the  figure,  RG  represents  that  S[  =  R 
and  ^2  =  G. 

The  windows  in  the  hue  images  were  selected  by 
using  the  corner  detector  algorithm  as  described  in 
Section  2.3.2.  So  if  a  hue  image  does  not  hpe  enough 
contrast,  fewer  windows  are  selected.  In  this  example, 
the  largest  number  of  windows  was  selected  for  the 
case  of  S[  =  R  and  5^  =  B  in  Figure  9.  Intuitively, 
that  result  indicates  that  there  are  not  many  green 
color  components  in  the  example  image. 

Several  examples  of  hue  images  are  shown  in  Fig¬ 
ure  10.  In  those  examples,  we  can  see  that  the  hue 
value  remains  almost  constant  on  the  object  surface 
with  large  shading  change.  For  instance,  in  Figure  10 
(c)  and  (d),  the  hue  of  the  yellow  duck’s  surface  ap¬ 
pears  to  be  constant  even  though  the  color  image  of 
the  duck  has  a  wide  range  of  intensity.  In  those  ex¬ 
amples,  background,  gray  color,  and  green  color  have 
been  eliminated  according  to  the  equations  (14),  (15) 
and  (16),  respectively. 

4  Experimental  Results 

The  proposed  method  was  used  for  recognition  and 
localization  of  objects  in  three  test  cases.  In  the  first 
case,  the  same  illumination  condition  was  used  both 
for  training  and  for  input  images.  In  the  second  case, 
input  images  were  taken  under  different  levels  of  illu- 
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Figure  11:  Recognition  Result 


mination  brightnesses.  In  the  last  case,  input  images 
were  taken  with  different  light  source  locations. 


4.1  Object  Recognition  and  Localization 
with  Hue  Image 

First,  a  set  of  training  eigen-windows  was  obtained 
as  described  in  Section  2.  The  training  images  were 
taken  at  6\  =  [—20, 0, 20]  and  0^  =  [0, 10, 20,  •  •  • ,  350] 
for  three  different  objects,  mug,  bird,  and  tylenol.  We 
refer  to  the  original  images  as  type{6i,62)-  F9r  ex¬ 
ample,  the  image  mug{— 20,60)  denotes  the  image 
for  the  mug  taken  at  the  position  =  -20deg  and 
02  =  OOdeg.  One  hundred  eight  images  were  taken  for 
each  of  the  objects  by  using  the  experimental  setup 
shown  in  Figure  1. 

Then,  eigen-windows  were  selected  in  each  train¬ 
ing  image  by  using  the  detectability,  similarity  and 
reliability  measurements  as  described  in  Section  2.3. 
The  number  of  eigen-windows  for  each  of  the  objects 
was  initially  more  than  8,000.  After  the  three  mea¬ 
surements  were  applied,  less  than  2,000  of  the  train¬ 
ing  eigen-windows  were  finally  obtained.  Then,  these 
eigen- windows  were  projected  to  produce  eigen-points 
according  to  the  equation  (3) . 

One  input  image  containing  multiple  objects  was 
taken  as  shown  in  left  hand  side  of  Figure  11.  In  the 
input  image,  there  are  7  objects,  duck,  mug,  barney, 
bird,  stop-sign,  tylenol,  and  tylenol-cold.  First,  eigm- 
windows  were  selected  in  the  input  image  by  using 
the  detectability  measure.  Then,  we  established  cor¬ 
respondences  between  the  input  eigen-windows  and 
the  training  eigen-windows  by  using  the  similarity 
between  their  eigen-points  according  to  the  equation 
(12). 

The  recognition  and  localization  results  are  shown 
in  figures  in  the  middle  column  in  Figure  If.  The 
figures  in  the  right  column  show  the  resulting  pose 
spaces.  Also,  the  obtained  affine  parameters  and  stan¬ 
dard  deviations  in  the  pose  space  are  shown.  As  we 
can  see,  each  object’s  type,  pose,  and  location  were 
successfully  obtained. 
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Figure  12:  Object  Recognition  Results  with  Illumina¬ 
tion  Change 

4.2  Effect  of  Illumination  Brightness 
Change 

The  same  training  eigen-window  set  was  applied  to 
input  images  taken  under  a  wide  range  of  illumination 
brightness.  Figure  12  shows  the  result. 

The  original  color  images  are  shown  in  the  left  col¬ 
umn,  and  the  computed  hue  images  are  shown  in  the 
middle  column.  The  localization  and  recognition  re¬ 
sults  are  shown  in  the  right  column.  The  affine  pa¬ 
rameters  and  standard  deviations  of  the  pose  space 
are  also  given  in  the  figure. 

The  hue  images  did  not  change  significantly  with 
different  levels  of  illumination  brightness.  The  rnain 
difference  between  the  hue  image  for  the  brightest  illu¬ 
mination  and  that  for  the  darkest  illumination  is  that 
hue  values  were  not  computed  over  a  large  portion  of 
the  object  surfaces.  This  is  because  intensity  values 
were  so  small  that  hue  values  were  set  to  zero  as  the 
background  value  according  to  the  equation  (15). 

The  experimental  results  show  that  the  proposed 
method  works  even  when  input  images  are  taken  with 
different  levels  of  illumination  brightness.  The  object 
was  recognized  and  localized  successfully. 

4.3  Effect  of  Different  Light  Source  Posi¬ 
tions 

The  proposed  method  was  also  applied  to  input  im¬ 
ages  taken  with  .different  light  source  locations.  As 
the  light  source  position  changes,  the  appearance  of 
objects  in  input  images  changes  drastically.  There¬ 
fore,  changing  light  source  position  makes  recognition 
and  localization  of  objects  even  harder  than  changing 
illumination  brightness. 

In  this  experiment,  four  different  light  source  posi¬ 
tions  were  used  as  shown  in  Figure  13.  The  left  column 
images  of  Figure  14  show  the  input  images  taken  with 
each  of  the  four  light  source  positions.  The  middle 
column  images  of  Figure  14  show  the  obtained  hue 
images.  The  right  column  images  present  the  recog¬ 
nition  and  localization  results.  The  affine  parameters 
and  standard  deviations  of  pose-space  are  also  shown 
in  the  figure. 

Note  that,  in  this  experiment,  there  was  no  ambi- 
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Figure  14:  Effect  of  Light  Source  Direction 


ent  illumination.  Hence,  the  appearance  of  the  objects 
change  significantly  with  different  light  source  posi¬ 
tions.  Nevertheless,  the  mug  was  correctly  recognized 
and  localized  except  in  the  input  image  for  the  light 
position  1.  In  this  case,  hue  values  were  not  obtained 
over  a  large  portion  of  the  object  surface  because  of 
shadow  casting  on  the  surface. 

5  Conclusion 

In  this  paper,  we  described  a  novel  method  called 
the  eigen-window  method.  The  method  extends  the 
standard  eigen-space  analysis  for  the  case  of  recog¬ 
nizing  partially  occluded  objects.  To  reduce  the  re¬ 
dundancy  among  eigen- windows,  we  proposed  three 
measures  for  selecting  eigen- windows  effectively:  de¬ 
tectability,  uniqueness,  and  reliability. 

By  using  hue,  which  is  an  illumination  invariant 
measure,  the  eigen-window  method  was  extended  fur¬ 
ther  for  recognition  and  localization  of  objects  in  im¬ 
ages  taken  under  changing  illumination  conditions.  To 
use  hue  information  of  input  images  reliably,  we  intro¬ 
duced  three  criteria  for  computing  hue  values:  inten¬ 
sity  value,  saturation,  and  phase. 

The  proposed  method  was  applied  to  real  images, 
and  the  method  recognized  and  localized  objects  suc¬ 
cessfully  even  in  images  taken  under  significantly  dif¬ 
ferent  illumination  conditions. 
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Abstract 

Edge  detection  is  the  process  of  estimating  extended 
curve  discontinuities  in  the  image  intensity  surface. 
As  a  primary  module  in  image  understanding  systems, 
its  performance  is  critical.  This  paper  presents  a  new 
family  of  algorithms  for  detecting  edges  in  an  image. 
Edges  with  delta,  step  and  crease  cross  sections  along 
curves  are  found.  In  the  algorithm,  an  image  is  con¬ 
volved  at  each  pixel  at  four  orientations.  Then,  the 
best  least-squares  estimate  of  an  edge  is  made  over  a 
2x2x2  cube  determined  by  adjacent  x,  y,  and  (p  values. 
Finally,  the  detected  edge  image  is  built  by  a  primi¬ 
tive  linking  process.  Both  theoretical  analyses  and 
simulation  experiments  have  been  done;  real  images 
have  been  analyzed.  Quantitative  results  indicate  that 
the  proposed  detector  has  superior  detection  and  lo¬ 
calization;  the  standard  deviations  of  orientation  and 
transverse  position  are  a  factor  of  four  better  than  the 
Wang-Binford  operator  at  the  same  signal  to  noise  ra¬ 
tio.  It  performs  well  even  when  the  ratio  is  less  than 
4. 

1  Introduction 

In  the  Successor  paradigm,  objects  are  interpreted 
globally  by  Bayesian  networks  that  express  probabil¬ 
ities  among  a  hierarchy  of  uncertain  geometric  hy¬ 
potheses.  The  3D  levels  in  Successor  include:  object 
(physical),  3D  volume,  3D  surface,  3D  curve,  and  3D 
point.  Quasi-invariants  and  invariants  correspond  3D 
surfaces  to  2D  image  areas,  3D  curves  to  2D  image 
curves,  and  3D  points  to  2D  image  points.  Successor 
requires  reasonable  measurement  of  extended  image 
edges  and  of  relations  among  extended  edges  in  order 
to  build  3D  interpretations. 

The  performance  of  the  edge  detector  affects  most 
modules  in  image  understanding  systems,  like  stereo, 
motion,  and  surface  estimation.  Until  recently,  the 
quality  of  measurement  of  extended  edges  extracted 
from  images  was  very  unsatisfactory  from  the  stand¬ 
point  of  vision  systems.  The  Wang-Binford  operator 
provided  reasonable  extended  edges.  This  algorithm 
provides  substantial  improvements  in  the  quality  of 
measurement  of  extended  edges. 

An  intensity  image  I{x,  y)  is  a  mathematical  surface. 
Edges  in  an  image  correspond  to  intensity  discontinu¬ 
ities  in  the  image  surface.  Low-order  discontinuities 


Crease  Edge 


Figure  1:  The  transverse  profiles  of  the  three  main 
types  of  edges. 


are  interesting,  i.e.  discontinuities  along  curves  with 
cross  sections  in  the  form  of  delta,  step,  and  crease,  as 
shown  in  Fig.  1  [1]. 

In  the  past,  a  variety  of  segmentation  algorithms  have 
been  developed,  like  the  Binford-Horn  operator  [2], 
the  Marr-Hildreth  operator  [3],  the  Canny  operator 
[4],  the  Nalwa-Binford  operator  [5],  However,  the  re¬ 
sults  are  not  satisfactory,  primarily  because  they  don’t 
produce  extended  edges  with  sufficient  quality,  thus, 
restricting  their  applications.  Since  edge  linking  to 
form  extended  edges  is  an  exponential  branching  pro¬ 
cess,  accuracy  of  edgel  estimation  was  required  to  cut 
the  branching  to  tolerable  limits.  With  the  Wang- 
Binford  operator  [6]  [7],  high  quality  extended  edges 
have  been  obtained.  Now,  the  operator  described  here 
performs  substantially  better. 

There  are  three  principal  problems  in  previous  edge 
detection  algorithms,  excluding  the  Wang-Binford  op¬ 
erator.  A  main  problem  is  that  the  shading  effect  was 
not  taken  into  account;  that  is,  the  intensity  of  an  im¬ 
age  along  an  edge  was  modeled  as  a  constant,  which 
is  not  true  in  the  real  scene.  The  second  problem  is 
that  the  estimates  of  orientation  and  position  were 
not  accurate.  The  inaccuracy  raised  severe  problems 
in  linking  pixels  into  extended  edges,  and  then  in  the 
follow-up  processes.  The  third  problem  is  that  only 
step  edges  were  covered  in  those  algorithms.  There 
are  three  main  types  of  edges:  delta  edges,  step  edges, 
and  crease  edges.  A  significant  amount  of  information 
in  an  image  is  lost  if  only  step  edges  can  be  detected. 
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A  new  edge  detector  which  overcomes  all  these  prob¬ 
lems  is  presented  here.  It  provides  least-squares  esti¬ 
mates  of  the  position,  orientation,  and  amplitude  of 
an  edge,  which  are  no  longer  sensitive  to  shading.  In 
addition,  it  estimates  three  kinds  of  edges  by  using  a 
single  family  of  algorithms.  Though  the  Wang-Binford 
operator  can  also  estimate  all  three  edge  types  reason¬ 
ably  well,  the  performance  of  this  detector  improves 
dramatically  over  Wang-Binford  from  both  practical 
and  statistical  standpoints.  Based  on  accurate  esti¬ 
mates  and  the  statistical  analyses  of  edgels,  isolated 
edgels  are  aggregated  into  extended  edges.  The  link¬ 
ing  process  is  still  under  development,  but  the  current 
results  look  promising.  In  this  paper,  a  generic  edge 
model  is  introduced  in  Section  2,  then  the  edge  detec¬ 
tor  is  developed  in  Section  3.  The  statistical  analysis 
is  discussed  in  Section  4.  Real  images  are  tested  in 
Section  5.  The  conclusions  are  given  in  Section  6. 

2  Edge  Model 
2.1  Ideal  Model 

Edges  are  defined  as  discontinuities  of  the  image  in¬ 
tensity.  According  to  the  order  of  discontinuity,  edges 
can  be  divided  into  three  groups  (see  Fig.  1):  delta 
and  step  edges  with  zeroth  order  discontinuity,  and 
crease  edges  with  first  order  discontinuity. 

An  extended  edge  in  an  image  can  be  approximated 
locally  by  a  tangent  straight  line  element  called  an 
edgel.  Near  an  edgel,  the  image  surface  can  be  ex¬ 
pressed  as  Io{l,t),  where  I  and  t  are  the  longitudinal 
and  transverse  directions  of  the  edgel  respectively.  In 
general,  Io{l,t)  can  be  decomposed  into  two  profile 
factors,  /£,(/)  and  /t(0' 

where  Il{1)  is  the  image  profile  in  the  longitudinal  di¬ 
rection  and  Irit)  is  the  image  profile  in  the  transverse 
direction. 

In  the  longitudinal  direction,  the  shading  effect  is  con¬ 
sidered,  i.e. 

Il{1)  =  h  +  Si  I 

where  h  is  the  amplitude  of  the  edgel,  and  si  is  the 
shading  coefficient.  Typically,  the  gray  level  of  |/*|  is 
larger  than  2  and  |si|  is  much  smaller  than  0.1.  The 
shading  effect  is  included  only  in  IlQ),  not  in  Irit), 
because  the  edgel  profile  dominates  the  intensity  of  an 
image  in  the  transverse  direction. 

Different  functions  are  used  to  model  different  types 
of  edgels  in  the  transverse  direction.  For  delta  edgels, 
lT{t)  can  be  described  by  a  delta  function;  for  step 
edgels,  Irit)  can  be  described  by  a  step  function;  for 
crease  edgels,  Ixit)  can  be  described  by  a  triangle 
function.  The  three  edgels  are  closely  related:  the 
derivative  of  a  triangle  function  is  a  step  function, 
and  the  derivative  of  a  step  function  is  a  delta  func¬ 
tion.  Hence,  step  edgels  can  be  determined  by  con¬ 
volving  the  image  with  the  first  derivative  of  the  mask. 


Similarly,  crease  edgels  can  be  determined  by  convolv¬ 
ing  the  image  with  the  second  derivative  of  the  mask. 
Most  of  the  discussions  in  this  paper  will  focus  on  de¬ 
tecting  the  delta  edgel,  but  they  can  be  applied  to  the 
other  two  types  of  edgels  as  well. 

2.2  Real  Model 

A  real  image  is  blurred  by  the  impulse  response  of 
the  optical  system,  and  is  perturbed  by  measurement 
noise.  Blur  can  be  represented  by  convolving  the  orig¬ 
inal  image  with  a  2-D  blur  function.  In  many  cases, 
the  blur  function  can  be  approximated  by  an  isotropic 
2-D  Gaussian: 

J  /2  _|_  ^2 

0(1.0=^  exp(-^) 


Noise  can  be  modeled  by  additive,  zero-mean,  white 
Gaussian  noise  for  which  the  probability  density  func¬ 
tion  fN{n)  is 


/jv(w)  = 


vA 


exp(-;r::2-) 


N 


2(r^ 


Combining  these  two  effects,  the  model  of  the  delta 
edgel  can  be  modified  as 

=  [{h  +  si  1)  6{t)]®Gil,t)  +  N{l,t) 

«  ih  + Si  1)  exp{—^)  + N{l,t) 

where  0  denotes  2-D  convolution. 

This  is  the  generic  edge  model  which  will  be  used 
throughout  the  rest  of  the  paper. 

3  Edgel  Detection 

Based  on  the  model  developed  in  Section  2,  a  new 
edge  detector  will  be  presented  in  this  section.  The 
main  function  of  this  detector  is  to  find  the  position 
{x,y),  orientation  a,  and  amplitude  h  of  all  edgels  in 
an  image.  Parameters  {x,y,a,h)  will  be  estimated 
from  the  moments  of  an  image,  which  are  the  con¬ 
volutions  of  the  image  with  a  set  of  masks.  Once 
these  parameters  are  solved  correctly,  the  edgels  are 
detected. 

3.1  Estimating  the  Moments 

To  detect  the  position,  orientation,  and  amplitude  of  a 
delta  edgel  (also  called  a  ridge),  a  mask  M  is  designed 
as 


M{u,  v) 

=  Guiu)G'yiv) 


1 


TTO-f, 


:exp(-^) 


V  1 
^/2TT(rl 


exp(- 
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where  {u,  v)  are  the  local  coordinates  of  the  mask.  The 
profile  of  M  along  the  v  direction  is  the  first  deriva¬ 
tive  of  a  Gaussian,  which  will  smooth  the  input  image 
by  convolving  it  with  a  Gaussian,  and  then  estimate 
the  directional  derivative  of  the  smoothed  image.  The 
profile  of  M  along  the  u  direction  is  another  Gaussian, 
which  will  average  the  estimations  of  the  derivatives. 

Let  moment  k  be  the  convolution  of  the  image  with 
the  mask  M .  Near  a  ridge,  k  can  be  described  as 

k{lm,tm,0)  (1) 

=  (2) 

=  Cexp(-^)(6o  -t-  him  +  htmlm  +  Mm)(3) 


6o  =  Si  sin  6 
bi  =  h  cos  6 

62  =  Si  cos  6 

'-2  ^2 

63  =  Si  — — sin  6  cos^  0 

du  =  0-1  +  al 
al  -  <tI  +  <tI 
al  =  al  sin^  0  +  al  cos^  0 

where  Im ,  tm  and  0  are  the  position  and  the  orientation 
of  the  mask  with  respect  to  the  ridge. 

So  far,  three  coordinate  systems  have  been  introduced: 
the  image  coordinate  system,  with  axes  {x,y);  the 
ridge  local  coordinate  system,  with  axes  (/,<);  and 
the  mask  coordinate  system,  with  axes  («,v).  Besides 
these,  there  is  one  more  coordinate  system  which  will 
show  up  shortly,  that  is  the  relative  coordinate  system 
of  the  ridge,  with  axes  (x,  y).  These  four  systems  are 
essential  not  only  for  clarification,  but  also  for  reduc¬ 
ing  the  computational  complexity. 

3.2  Estimating  the  Parameters 

Obviously,  Equation  (3)  is  too  complicated  to  deal 
with,  and  needs  to  be  simplified.  There  are  two  di¬ 
rections  of  simplification.  First,  keep  9  as  small  as 
possible,  such  that  cos0  can  be  approximated  as  1  and 
sin0  can  be  approximated  eis  6.  Second,  keep  1/^1  and 
|tm|  less  than  1.  As  mentioned  in  Section  2.1,  the  ab¬ 
solute  ratio  of  h  and  si  is  much  larger  than  20,  so  that 
when  |/m|  and  |tm|  are  less  than  1,  the  higher-order 

A  2  -  2 

terms,  (si  cos  9)  tmlm  and  (si  sin  9  cos^  9)  tl^,  in 
(3)  are  negligible  compared  to  (hcos^)  tm- 

Since  9  is  the  orientation  of  the  mask  with  respect  to 
the  ridge,  in  order  to  keep  9  small,  the  orientation  an¬ 
gle  of  the  mask  must  be  chosen  close  to  the  angle  of 
the  ridge.  In  the  algorithm,  there  are  four  possible  ori¬ 
entation  angles  for  masks,  which  are  0,  7r/4,  7r/2,  and 


Figure  2:  The  transformation  from  the  ridge  local  co¬ 
ordinates  {l,t)  to  the  relative  coordinates  {x,y). 

37r/4.  The  two  masks  with  orientation  angles  closest 
to  the  angle  of  the  ridge  will  be  used,  which  makes 
|0|  <  7r/4  and  makes  the  first  simplification  reason¬ 
able. 

To  keep  Im  and  tm  small,  the  relative  coordinates  (x,  y) 
mentioned  in  Section  3.1  need  to  be  used  here.  The 
origin  of  relative  coordinates  is  placed  in  the  center  of 
the  grid  square  where  the  ridge  is  passing,  then  the 
axes  of  relative  coordinates  are  rotated.  There  are 
also  four  possible  rotation  angles,  0,  w/4,  7r/2,  and 
Zir/4.  The  one  closest  to  the  angle  of  the  ridge  will 
be  used.  Because  the  masks  will  be  placed  on  the 
grids  of  image  coordinates,  choosing  the  masks  whose 
positions  are  right  on  the  vertices  of  this  square  will 
keep  |/m|  and  |<m|  less  than  \/2/2.  Therefore,  it  makes 
sense  to  eliminate  the  higher-order  terms  in  (3),  and 

to  approximate  exp(— as  1. 

By  approximating  cos^,  sin^  and  exp(— 5^),  and  then 
eliminating  the  terms  with  order  higher  than  2  in  (3), 
the  moment  k  can  be  simplified  as: 

k{lmy^m)9')  ^  0  (60  “i"  1*1^771 )  (4) 

=  c{sial9  +  htm)  (5) 

w  c  htm  (6) 

here  Sxal9  is  ignored  from  (5)  to  (6)  because  it  is 
much  less  than  htm-  Typically,  |si|  is  much  smaller 
than  0.1,  al  is  about  2,  |0|  is  smaller  than  7r/4,  |h|  is 
much  bigger  than  2,  and  |tm|  is  smaller  than  V^/2. 

The  parameters  in  (6)  are  evaluated  with  respect  to 
the  ridge  local  coordinates  (/,<).  They  can  be  trans¬ 
ferred  into  the  relative  coordinates  (x,  y)  (see  Fig.  2) 
by  using  the  following  transformation. 

Im  =  cos(a  -ij))  x  +  sin(a  -  -  yo)  (7) 

tm  =  -  sm{a  -  Ip)  X  +  cos{a  -  ip){y  -  yo)  (8) 

where  a  is  the  angle  of  the  ridge,  tp  is  the  rotation 
angle  of  the  relative  coordinates,  and  (0,yo)  is  the  in¬ 
tersection  of  the  ridge  and  the  y  axis  of  relative  coor¬ 
dinates. 

Substituting  (8)  into  (6),  moment  k  can  be  expressed 
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k,;  W 


/ 

7 

(X2, 

i>y; 

.v,) 

k4: 

/ 

/ 

yi.^’2) 
j(x2.  y^'^i) 


kj:  (Xp  yp^n  I^(X2-  yp 

(b) 

Figure  3:  (a)  A  ridge;  (b)  The  eight  nearby  points  for 
the  ridge. 


k{x,y,(p)  (9) 

«  ch  [-  sin(Q  -  •0)^  +  008(0  -  0)(y  -  yo)](10) 
!w  ch[-{a-rj})x  +  {y-yo))]  (11) 

where  x.  y  and  ip  are  the  position  and  the  orientation 
of  the  mask  with  respect  to  the  relative  coordinates.  If 
0  is  chosen  to  be  closest  to  a  among  the  four  rotation 
angles,  the  maximum  value  of  |q:  —  ip\  will  be  tt/S,  such 
that  sin(a  -  0)  can  be  approximated  as  (a  -  0),  and 
cos(o  —  0)  can  be  approximated  as  1,  as  shown  from 

(10)  to  (11). 

Let  ho  be  equal  to  ch,  and  oq  be  equal  to  (a  -  0).  In 

(11) ,  the  three  unknown  parameters,  ho,  oq,  and  yo, 
contain  information  about  the  amplitude,  orientation, 
and  position  of  a  ridge  respectively.  In  order  to  esti¬ 
mate  these  parameters,  it  is  necessary  to  have  at  least 
three  equations.  Based  on  the  criterion  used  in  sim¬ 
plifying  (3),  given  a  ridge  with  angle  a  and  location 
(a:,  j/),  the  eight  nearby  moments  should  be  adopted. 
Those  eight  points  are  shown  in  Fig.  3,  with 


Xi  = 

[xj 

X2  = 

[*1 

2/1  = 

ly\ 

y2  = 

[2/1 

Pi  = 

1— J 

P2  = 

r— 1 

3.3  Choosing  the  Angles 

As  shown  in  the  previous  section,  there  are  two  kinds 
of  angles  that  need  to  be  chosen  in  the  algorithrn. 
They  are  the  rotation  angle  0  of  the  relative  coordi¬ 
nates,  and  the  orientation  angle  ipi  of  the  mask.  Note 
that 

,  .  TT  37r 

0  e  {0, -,7r,  y) 

TT  3x. 

<Pi,‘P2  G  {0,  2-’^.  Y) 

Depending  on  the  angle  a  of  the  targeted  ridge,  0  is 
chosen  to  be  the  closest  angle  to  a,  and  pi  and  p2 
are  chosen  to  be  the  closest  two  angles  to  a.  Since 
a  is  unknown  before  the  detection,  0,  pi,  and  p2  are 
chosen  on  a  trial-and-error  basis  by  dividing  the  image 
plane  into  eight  regions  (see  Table  1). 


Table  1:  the  angle  of  the  ridge  (a),  the  rotation  angle 
of  the  relative  coordinates  (0),  and  the  orientation 


Region  1 1 

a 

0 

^1,^2 

w^m\ 

0 

0, 5r/4 

mmm\ 

iosBesm 

warn 

0, 7r/4 

am\ 

warn 

fumm\ 

warn 

^■■1 

warn 

^ ^ 

»!«■ 

■EE7iHBIiai 

isziom 

8 

0 

In  the  algorithm,  the  ridge  is  first  assumed  to  belong 
to  region  1.  In  this  case,  0  will  be  chosen  as  0;  pi  and 
P2  will  be  chosen  as  0  and  7r/4  respectively.  The  oper¬ 
ator  discussed  in  the  previous  section  is  then  used  to 
determine  the  three  unknown  parameters  of  the  ridge. 
Recall  that  one  of  the  three  unknowns  is  the  orienta¬ 
tion.  If  the  ridge  with  that  estimated  orientation  is 
in  region  1,  this  means  the  hypothesis  is  correct,  and 
the  ridge  is  detected  successfully.  Otherwise,  the  re¬ 
sult  should  be  discarded,  and  the  next  region  will  be 
tested.  For  region  j  =  1 . .  .8,  the  values  of  0,  pi, 
and  p2  in  Table  1  are  used. 


hence,  there  are  eight  equations  with  three  unknowns: 

hi  =  ho  [-ao  Xi  +  {yi  -  yo)]  i  =  1  -  • .  8 

To  estimate  the  three  unknown  parameters,  least- 
squares  fitting  is  used.  Let 

8 

«=i 

The  set  of  (yo,  «o,  ho)  which  minimizes  e  will  be  found 
(see  Section  4);  that  is,  an  edgel  candidate  will  be 
detected. 


4  Statistical  Analysis 

Because  noise  in  an  image  causes  spurious  edgels  and 
false  alarms,  it  is  useful  to  understand  how  noise  af¬ 
fects  the  detections  of  edgels.  The  fluctuations  of  the 
edgel  parameters  provide  not  only  information  about 
how  robust  the  algorithm  is  under  noise,  but  also  a 
mathematical  basis  for  the  linking. 

As  mentioned  in  Section  3.3,  there  are  eight  regions 
in  total.  Depending  on  which  region  the  edgel  belongs 
to,  the  least-squares  solutions  will  be  slightly  different. 
But  since  the  idea  applied  to  each  region  is  the  same, 
the  analyses  for  them  are  also  similar.  In  this  section. 
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an  edgel  in  region  1  will  be  discussed  thoroughly.  For 
edgels  in  the  other  regions,  the  results  of  theoretical 
analyses  will  be  given  in  Appendix  B. 

From  (12),  the  least-squares  solutions  for  an  edgel  in 
region  1  are: 

^1  +  ^2  +  ^5  +  ^6  -  ^3  -  ^4  -  ^7  -  ^8 

4 

ki  +  ks  +  +  kr  -  k2  -  k4  -  ke  -  kg 

ki  +  k2  + k5  + ke-  ka-  ki-  kj  -  kg 

1  ki  k2  k^  k^  +  k^  ke  -i-  kj  ks 

2  ki  +  k2  +  k^  +  ke  -  ka  -  ki  -  kT  —  ks 

where  ki  is  the  convolution  of  the  noisy  image  with 
the  mask  M,. 

Let  Aao  be  the  fluctuation  of  the  orientation, 

Aao  =  ao  —  ao 

=  + 

=  6ao  +  H.O.T. 

where,  Sao  is  the  linear  approximation  of  Aao- 

Then,  the  standard  deviation  of  5o  can  be  expressed 
as 


ho  — 
do  = 
Po  = 


cr|^  «  Kar(5ao) 
8 


E 

8  8  r 

EE 

i=l  j  =  l  '■ 


L  .=1 


dao  dao  ^ 

—  —Ei6k,  X 


(13) 

(14) 

(15) 


In  order  to  compute  the  variance  of  Soq,  it  is  nec¬ 
essary  to  first  calculate  the  variance  of  Ski,  and  the 
covariance  of  Skj  and  6k j. 


Recall  that  the  moment  k  is  the  convolution  of  the 
noisy  image  with  the  mask  M,  and  that  the  noise  is 
modeled  as  the  additive,  zero-mean,  white  Gaussian 
function  with  standard  deviation  cr^.  Since  the  con¬ 
volution  is  a  linear  operation,  it  can  be  deduced  that 
the  fluctuation  of  the  moment,  6k,  will  be  a  Gaussian 
random  variable  with  mean  and  variance  expressed  as 


E{6ki)  = 
Var(6ki)  = 


E{6ki  X  Ski) 

J  Mi{u,v)^dudv 


(Tff 


1 


Stto-uCt^ 


_2 

CTyy 


Similarly,  the  covariance  of  Ski  and  Skj  can  be  derived 
from 


here  i  =  1 ...  8  and  j  =  1 ...  8.  All  of  them  will  be 
listed  in  Appendix  A. 

Let  (Tu  and  a„  equal  1 ,  and  substitute  all  the  variances 
and  covariances  into  (15),  the  approximate  fluctuation 
of  the  orientation,  Sao,  will  be  a  Gaussian  random 
variable  with  mean  and  variance  expressed  as 

E{Sao)  =  0  (16) 

2 

yar(^ao)  (0.12 -f  O.lOao  +  0.420^)^  (17) 

Using  the  same  method,  the  approximate  fluctuation 
of  the  position,  Syo,  will  be  a  Gaussian  random  vari¬ 
able  with  mean  and  variance  expressed  as 

E{Syo)  =  0  (18) 

2 

Var{Syo)  ^  (0.2418  +  0.4249y^)^  (19) 

Conventionally,  the  ratio  of  h  and  is  called  the  sig¬ 
nal  to  noise  ratio,  or  the  normalized  contrast.  From 
(17)  and  (19),  it  can  be  seen  that  edges  with  larger 
contrast  have  smaller  fluctuations,  and  edges  with 
smaller  contrast  have  larger  fluctuations. 

These  models  of  fluctuations  have  been  verified  by  ex¬ 
periments.  In  the  first  step  of  simulation,  synthesized 
images  with  edges  in  known  positions  and  orientations 
were  generated.  Then  the  edge  detector  was  applied 
to  them  to  determine  the  edgel  parameters.  Finally, 

the  standard  deviations  of  So  and  were  calculated. 

All  the  simulation  data  and  theoretical  result  are  plot¬ 
ted  in  Fig.  4  and  Fig.  5:  the  relation  between  a&o  (in 
radians)  and  contrast  is  plotted  in  Fig.  4;  the  rela¬ 
tion  between  (in  pixels)  and  contrast  is  plotted  in 
Fig.  5.  In  these  figures,  circles  represent  the  sample 
variances  of  the  simulation  data,  and  line  represents 

the  expected  values  of  cr?  and  a?  over  ao  and  yo 

“0  Ho 

respectively,  i.e. 

h^ao(<^L)  ^  ^“0  [Var(Sao)] 

=  0.1449 
«  Eyo  [V"aJ-(^i/o)] 

(j\j 

=  0.2772 

here,  oq  is  assumed  to  be  uniformly  distributed  be¬ 
tween  0  and  7r/8,  and  yo  is  assumed  to  be  uniformly 
distributed  between  —0.5  and  0.5. 

It  is  obvious  that  the  simulation  data  do  match  with 
the  theoretical  result.  Furthermore,  this  algorithm  is 
robust.  It  can  detect  the  edges  accurately  even  under 
large  noise.  For  example,  when  the  contrast  equals  4, 
the  standard  deviation  of  do  is  about  2~^  radians,  and 
the  standard  deviation  of  j/q  is  about  2“^  ®  pixels. 


E{6ki  X  Skj)  =  J  Mi{u,v)Mj{u,v)dudv 


The  simulation  data  (cross  mark)  and  theoretical  re¬ 
sult  (dash  line)  of  the  Wang-Binford  operator  are  also 
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Contrast 

Mark:  simulation  data  Line:  theoretical  result 

Figure  4:  aao  (in  radians)  versus  contrast.  The  per¬ 
formance  of  the  Wang-Binford  operator  is  included  as 
a  comparison. 


Contrast 

Mark:  simulation  data  Line:  theoretical  result 

Figure  5:  <t-  (in  pixels)  versus  contrast.  The  perfor¬ 
mance  of  the  Wang-Binford  operator  is  included  as  a 
comparison. 


plotted  in  the  figures  as  baselines  for  comparison.  The 
standard  derivations  of  both  orientation  and  position 
have  been  improved  by  a  factor  of  four  in  the  edge 
detector  described  here. 

5  Edge  Images 
5.1  Linking 

The  detected  edgels  in  an  image  contain  little  infor¬ 
mation  about  the  scene  unless  they  are  linked  into 
extended  edges.  Previous  work  did  little  about  link¬ 
ing  due  to  the  lack  of  accurate  edgel  data  [8]  [9].  In 
this  algorithm,  since  the  estimates  of  edgels  are  well 
done,  the  linking  process  becomes  much  easier. 

Assume  there  are  two  nearby  edgels  with  parameters 
(a:i,t/i,ai)  and  (2:2, 2/2,  “2),  respectively.  The  hypoth¬ 
esis  used  here  is  that  two  edgels  belong  to  the  same 
extended  edge.  If  it  is  true,  the  intensities  of  two  edgels 
should  have  the  same  sign  and  the  difference  of  their 
orientations  should  be  small.  Statistically,  the  hypoth¬ 
esis  may  be  rejected  when  it  is  true.  The  probability  of 
it  is  called  the  false  negative  rate.  The  false  negative 
rate  depends  on  the  threshold  [10].  As  discussed  in  the 
previous  section,  the  fluctuation  of  the  edgel  orienta¬ 
tion  is  approximated  by  a  Gaussian  random  variable, 
thus  the  threshold  of  a  can  be  properly  chosen  to  al¬ 
low  a  constant  false  negative  rate.  In  order  to  have  the 
rate  equal  to  0.01,  the  threshold  is  set  to  be  2.8  <t„. 
i.e.  the  acceptance  region  of  the  hypothesis  test  is 

|ai  —  02]  <  2.8  <Ta 


From  Section  4  and  Appendix  B,  the  fluctuations  of 
the  edgel  orientation  in  different  regions  are  derived. 
By  averaging  all  of  them,  the  variance  of  a  can  be 
written  as  2 

£(<r2)«  0.2554  ^ 

Combining  these  two  equations,  the  criterion  used  to 
group  two  edgels  is 

|ai  —  02!  <  2.8  (To 

=  2.8  X  (0.5053  ^) 

~  TT  a-;V 

”2/1 

That  is,  two  nearby  edgels  will  be  linked  together  if  the 
difference  of  their  orientations  is  smaller  than  f 

The  searching  region  is  defined  in  Fig.  6.  The  pixel 
in  which  the  current  detected  edgel  sits  is  marked  by 
a  dot.  The  other  12  pixels  in  the  searching  region  are 
marked  by  numbers.  The  numbers  indicate  the  testing 
orders  which  are  based  on  the  distances  between  the 
testing  pixels  and  the  current  pixel.  The  shorter  the 
distance  is,  the  higher  priority  the  testing  pixel  has. 
In  other  words,  the  pixel  with  a  lower  number  will 
be  tested  first.  Note  that  it  is  not  necessary  to  test 
the  pixels  righter  or  lower  than  the  currently  working 
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Figure  6:  The  searching  region  for  linking.  The  dot 
marks  the  pixel  in  which  the  current  detected  pixel 
sits.  The  numbers  mark  the  pixels  which  will  be 
tested. 


Though  only  delta  edges  have  been  discussed  thor¬ 
oughly  in  this  paper,  the  other  two  kinds  of  edges  can 
be  detected  successfully  by  the  same  family  of  opera¬ 
tors,  as  shown  in  Section  5.2.  The  results  indicates 
that  the  extended  edges  are  very  accurate  and  infor¬ 
mative. 

The  speed  of  the  detector  is  fast.  It  takes  20  seconds 
to  detect  a  256x256  image  on  an  SGI  Indy  R4400SC 
workstation. 


Appendix  A 


pixel  because  the  tests  will  be  done  when  the  operator 
moves  to  those  pixels. 


The  covariances  for  the  adjacent  eight  moments  ki, 
i  =  1 . .  .8  (see  Fig.  3(b)). 


To  sum  up,  the  operator  scans  the  whole  image  once, 
from  left  to  right,  then  from  top  to  bottom.  If  it  de¬ 
tects  an  edgel,  it  will  test  the  pixels  in  the  searching 
region.  As  soon  as  the  edgel  in  the  testing  pixel  meets 
the  linking  criterion,  it  will  be  grouped  with  the  de¬ 
tected  edgel,  and  the  tests  will  stop.  If  all  12  tests 
fails,  the  detected  edgel  will  become  the  beginning  of 
a  new  edge. 

5.2  Results 

The  detected  edge  images  are  superior  even  though 
the  linking  criterion  used  here  is  straightforward.  Ex¬ 
amples  are  given  in  Fig.  7,  Fig.  8,  Fig.  9,  and  Fig. 
10.  Fig.  7  shows  the  detected  delta  edges  image  of 
a  natural  scene;  Fig.  8  shows  the  detected  step  edge 
image  of  roads  Fig.  9  shows  the  detected  step  edge 
image  of  buildings;  and  Fig.  10  shows  the  detected 
crease  edge  image  of  a  thread.  The  results  indicate 
that  the  operator  detects  all  three  types  of  edges  suc¬ 
cessfully  and  extracts  the  orientations  and  positions 
of  edges  correctly.  Moreover,  it  takes  only  20  seconds 
to  detect  a  256x256  image  on  an  SGI  Indy  R4400SC 
workstation. 


Let  ^ , 


E(^6ki  X  6k2)  —  E{8k^  x  Sk^) 
E{Sk5  X  Ske)  =  E{6k7  x  Skg) 
■>  1  .  1  . 


) 


E{6ki  X  Sks)  =  E{6k2  x  bk^) 
E(6k^  X  Skj)  =  E{6ke  x  Sk^) 

8^  exp(-j^)(l  - 


E{6ki  X  6ki)  =  E{6k2  x  6kz) 
E{dkz  X  bks)  =  E{dkQ  x  6k7) 


CTn 


E{dki  X  dkz)  =  E{bk2  x  bk^) 
E{bk3  X  bk7)  =  E{bk^  x  bks) 
^  .2  1 


6  Conclusion 

An  edge  detector  for  delta  edges,  step  edges,  and 
crease  edges  has  been  developed.  It  is  insensitive  to 
shading  and  noise,  and  it  is  fast. 


E{bki  X  bk^)  —  E{bk2  x  Sfcs) 
E{bk3  X  bks)  =  E{bki  x  bk7) 


1 

87r(7-‘* 


Based  on  the  physics  of  image  formation,  an  edgel  is 
modeled  by  three  parameters,  the  position,  orientation 
and  amplitude.  To  extract  information  from  the  input 
image,  the  image  is  first  convolved  with  a  mask.  By 
sampling  the  convolved  image  at  each  pixel  at  four 
orientations,  the  2x2x2  cubes  are  built.  The  three 
edgels  parameters  are  estimated  over  the  cube;  that 
is,  the  least-squares  estimate  of  an  edgel  is  obtained 
from  the  eight  simplified  equations  under  the  relative 
coordinate  system. 

Theoretical  analyses  have  been  verified  by  simulation. 
It  is  shown  that  the  algorithms  can  detect  edges  well 
even  when  the  signal  to  noise  ratio  is  less  than  4. 


E{bki  X  bkT)  =  E{bk2  x  bks) 
E{bk3  X  bk^)  =  E{bk4  x  bke) 


V2 


2  1  /  1  N/1  1a 


E{bki  X  bks) 
1  ,  1 


V2 


'N 


87rcr^ 


E[bki  X  bke) 


V2  ^  87r<74 


=  E{bk4  X  bk^) 

exp(- j^)(l  -  i) 

=  E{bk2  X  bk7) 
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Appendix  B 

The  expected  values  of  the  theoretical  analyses  for 
edgels  in  region  i,  i  =  1 . .  .8  (see  Table  1). 


=  cr„  =  1, 
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(b) 

Figure  7:  (a)  The  original  image  (256  x  256);  (b)  The 
detected  delta  edge  image. 


(b) 

Figure  8:  (a)The  original  image  (256  x  256);(b)The 
detected  step  edge  image. 
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Figure  9;  (a)  The  original  image  (256  x  256);  (b)  The 
detected  step  edge  image. 


Figure  10:  (a)  The  original  image  (135  x  136);  (b)  The 
detected  crease  edge  image. 
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Abstract 

This  paper  presents  a  methodology  for  automat¬ 
ically  extracting  a  set  of  positions  from  the  im¬ 
age  of  a  target  to  assemble  a  template,  which 
can  enable  high  accuracy  detection  and  local¬ 
ization  of  the  target  in  imagery  using  template 
matching  based  object  recognition  algorithms. 

The  selection  criteria  is  based  on  determining 
the  covariance  of  the  location  produced  by  the 
template  matching  algorithm.  The  methodol¬ 
ogy  minimizes  this  covariance  by  selecting  the 
right  set  of  positions.  Using  only  subsets  of  posi¬ 
tions  in  the  object  can  increase  the  invariance  of 
the  template,  and  has  the  potential  of  improv¬ 
ing  the  robustness  of  the  template  matching  al¬ 
gorithm  to  non-additive,  non-Gaussian  noise. 
This  methodology  can  also  be  generalized  to 
apply  to  many  distance  based  recognition  algo¬ 
rithms. 

1  Introduction 

The  difficulty  with  the  classical  use  of  templates 
for  target  recognition  is  two  fold:  one  has  to 
do  with  the  spatial  perturbation  of  the  target 
in  the  image  and  the  other  has  to  do  with  the 
fact  that  the  gray  scale  perturbing  noise  is  not 
additive  and  non-Gaussian.  Therefore  when  the 
template  matching  is  done  or  when  the  matched 
filtering  is  done,  the  quantity  computed  is  not 
related  to  the  log  probability  of  the  data  given 
the  target.  In  this  paper,  we  try  to  deal  with 
the  second  effect,  the  effect  of  non-additive  non- 
Gaussian  perturbation. 


When  designing  a  particular  template  matching 
algorithm,  there  are  two  things  to  be  done  first: 
obtaining  the  template  and  devising  the  similar¬ 
ity  measure.  Conventional  template  matching 
algorithm  development  is  like  an  “open-loop” 
system,  where  identifying  the  template  is  always 
the  first  step  and  after  that  the  obtained  tem¬ 
plate  is  held  optimal  and  no  update  is  intended 
for  it.  Our  methodology  “closes  the  loop”  in  the 
algorithm  development  stage  by  examining  the 
location  produced  by  a  choice  of  the  template 
(and  its  corresponding  similarity  measure),  and 
using  the  obtained  information  to  modify  the 
template  so  that  the  new  template  will  pro¬ 
duce  a  better  estimate  of  the  location.  This  is 
achieved  by  analytically  propagating  the  pertur¬ 
bation  on  the  image  data  through  to  the  covari¬ 
ance  of  the  location  estimated  for  the  position  of 
the  target.  In  the  particular  context  of  target 
detection  and  recognition,  we  start  with  some 
small  initial  template  and  find  the  spatial  di¬ 
rection  in  which  the  estimated  location  has  the 
highest  variance.  Then  we  examine  other  parts 
of  the  target  area  and  add  into  the  template 
those  areas  such  that  the  resulting  template  will 
give  a  location  estimate  whose  variance  in  that 
direction  is  the  smallest. 

Om:  discussion  addresses  the  simplest  example 
where  the  templates  are  sets  of  pixel  positions 
with  their  gray  level  values,  and  the  similar¬ 
ity  measure  (actually  a  difference  measme)  is 
the  mean  squared  error.  We  also  choose  the 
1-D  model  to  simplify  the  notation.  However, 
the  templates  can  be  composed  of  other  more 
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general  feature  values  such  as  the  coefficients 
of  some  orthogonal  transforms  of  the  image  of 
the  object.  Therefore,  this  methodology  is  not 
restricted  to  the  example  presented  here.  Since 
many  of  the  existing  matching  based  algorithms 
follow  this  formulation  and  their  similarity  mea¬ 
sures  are  usually  fairly  standard,  the  methodol¬ 
ogy  we  present  can  be  applied  to  a  variety  of 
algorithms  to  improve  their  performance. 

2  Subset  versus  entire  object 

Template  matching  using  plain  pixel  values  is 
mostly  used  in  the  cases  where  the  interested 
objects  do  not  have  many  outstanding  features, 
or  where  such  featmes  cannot  be  detected  reli¬ 
ably.  Conventional  template  matching  methods 
use  all  positions  and  their  pixel  values  inside  the 
object  of  interest.  The  idea  behind  this  practice 
is  to  use  as  much  information  as  possible  about 
the  object.  Most  often,  pixels  in  the  template 
are  treated  equal  in  their  contribution  in  the 
similarity  measure. 

However,  using  target  areas  which  are  spatially 
non-distinct  and  have  low  contrast  not  only  do 
not  contribute,  but  in  effect  can  negatively  af¬ 
fect  detection  accuracy  and  localization.  The 
areas  of  the  target  that  are  least  affected  by 
non-additive  non-Gaussian  noise  are  those  tar¬ 
get  areas  in  which  there  is  a  relatively  high 
contrast  spatial  structme.  The  high  contrast 
will  be  the  dominant  effect  relative  to  the  non¬ 
additive  non-Gaussian  noise  pertmrbation  and 
thereby  permit  good  detection  even  with  a  tech¬ 
nique  whose  basis  assumes  an  additive  Gaussian 
perturbation.  Assembling  these  spatial  struc¬ 
tures  together  as  the  template  will  permit  good 
localization  of  the  position  of  the  detected  tar¬ 
get. 

In  the  next  section,  we  introduce  an  idea  of 
measuring  the  discriminatory  power  of  a  tem¬ 
plate.  Although  the  discriminatory  power  of  the 
template  is  the  biggest  when  all  points  are  in¬ 
cluded,  the  major  share  of  the  discriminatory 
power  is  often  contributed  by  a  small  subset  of 
the  points,  and  the  contribution  from  the  rest 
of  the  points  is  marginal.  Some  of  the  points 
from  which  most  of  the  discriminatory  power 


comes  will  be  chosen  to  form  the  so  called  most 
discriminatory  set.  When  this  set  is  used  to  rep¬ 
resent  the  original  entire  object,  the  deviation  of 
the  assumed  noise  model  from  the  true  distribu¬ 
tion  of  the  noise  will  have  the  least  effect  on  the 
process  of  locating  the  object.  Thus,  although 
a  sTna.11  amount  of  the  discriminatory  power  is 
lost  by  not  using  all  the  points,  this  disadvan¬ 
tage  is  overcome  by  the  increased  robustness  to 
noise  for  which  the  assumed  model  is  wrong. 

3  Finding  the  best  subset 

3.1  A  measure  of  discriminatory 
power  of  templates 

When  template  matching  is  used  to  detect  and 
locate  a  certain  object  in  a  noisy  observation, 
the  result  produced  is  usually  perturbed  from 
the  true  location.  This  perturbation  is  called 
the  output  perturbation,  and  is  a  function  of 
both  the  template  and  the  perturbation  present 
in  the  observation  which  is  called  the  input 
pertmrbation.  This  output  pertmbation  consti¬ 
tutes  the  error  of  the  output  from  the  template 
matching  algorithm,  and  is  a  random  variable 
usually  with  mean  zero.  The  covariance  prop¬ 
agation  technique  [Haralick,  1996]  can  be  used 
to  express  the  covariance  of  the  error  as  a  func¬ 
tion  of  the  covariance  of  the  input  pertmbation. 
This  function  is  a  good  measme  of  how  sensi¬ 
tive  the  location  of  this  template  is  to  the  input 
pertmbation.  When  the  variance  of  the  input 
pertmbation  is  kept  at  a  fixed  level,  this  sensi¬ 
tivity  is  high  if  the  variance  of  the  error  is  large, 
and  hence  we  say  the  discriminatory  power  of 
the  template  is  low.  If  the  variance  is  small,  the 
sensitivity  is  low,  so  the  template  is  relatively 
easier  to  detect  and  locate  in  the  pertmbed  ob¬ 
servation,  therefore  it  has  more  discriminatory 
power. 

Consider  the  example  of  a  1-D  template  in  Fig- 
me  1.  We  want  to  use  the  MSB  as  the  sim¬ 
ilarity  measme  to  detect  this  pattern  in  some 
pertmbed  input.  Intuitively,  some  points  in  the 
neighborhood  of  low  contrast  could  be  taken 
away  from  the  template  without  significantly 
lowering  the  performance  of  matching.  How¬ 
ever,  if  one  or  two  of  the  large  values  were  taken 
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Figure  1:  An  example  of  a  discrete  1-d  tem¬ 
plate.  Dotted  line  is  the  cubic  spline 
interpolating  function. 


away,  the  uncertainty  of  the  result  should  in¬ 
crease  significantly. 

We  did  an  experiment  to  see  if  this  intuition  is 
correct.  We  generated  a  100-point  iid  Ar(0, 1) 
noise.  Then  we  multiplied  the  template  in  Fig¬ 
ure  1  by  2.3  and  added  it  to  the  noise,  with  its 
first  point  aligned  with  point  #43  of  the  noise. 
This  is  one  realization  of  the  perturbed  obser¬ 
vation.  Then  we  used  exhaustive  search  to  find 
the  estimate  of  the  template’s  location  that  min¬ 
imizes  the  MSB.  This  procediure  was  repeated 
100,000  times  and  the  sample  mean  and  vari¬ 
ance  of  the  difierence  between  the  estimate  and 
the  true  location  were  obtained.  The  sample 
variance  in  this  case  was  1.28  x  10“^.  Then 
we  did  two  more  experiments  with  the  point 
#3  and  #10  taken  out  from  the  original  tem¬ 
plate  respectively.  The  sample  variances  were 
7.09  X  10“^  and  8.15  x  10~^,  respectively.  So 
taking  out  point  #3  did  not  matter  too  much, 
but  taking  out  point  #10  caused  big  drop  in 
the  performance.  This  result  confirmed  our  in¬ 
tuition  that  points  in  a  template  differ  in  their 
contribution  to  the  discriminatory  power  of  the 
template. 

In  the  next  sub-sections,  we  will  use  covariance 
propagation  to  get  a  theoretical  prediction  of 
the  output  variance  as  a  function  of  the  input 
variance. 


3.2  Notation 

Since  we  will  be  applying  the  technique  to  tar¬ 
get  recognition  in  digital  images,  we  discuss  the 
problem  in  the  discrete  case. 

A  template  consists  of  a  set  of  positions,  each 
of  which  is  associated  with  the  signal  value 
of  the  template  at  that  position.  Let  X  = 
{xi,X2,...,xj\^}(xi  <  X2  <  ...  <  xj\r)  be  a 
set  of  positions.  The  position  of  the  n-th  point 
of  the  template  relative  to  the  chosen  reference 
point  is  Xn.  The  signal  value  of  the  point  is 
h{xn).  Let  H  =  {h{xi),h{x2),  ■  ■  ■  ,h{xN))^  be 
an  iV  X  1  vector  denoting  the  template. 

Suppose  our  observation  is  made  on  a  finite 
set  of  M  discrete  positions.  This  set  is  de¬ 
noted  as  r  =  {71,72, •••  )7m}-  For  each  posi¬ 
tion  we  observe  a  signal  value  5(7^),  where 
s(-)  is  a  real  function  defined  on  F.  Let  S  = 
(s(7i),s(72),...,s(7m))^  be  an  M  X  1  vector 
for  the  observed  values,  which  we  call  the  ob¬ 
servable. 

Since  X  and  F  are  both  discrete  point  sets, 
there  is  only  a  finite  set  of  locations  where  the 
template  can  possibly  appear  in  the  observable. 
(This  restriction  could  be  relaxed  if  we  would  es¬ 
timate  the  location  of  the  template  to  subpixel 
accuracy,  but  we  choose  not  to  do  that  and  re¬ 
quire  X  and  F  to  be  on  the  same  sampling  grid.) 
We  use  9  to  denote  the  location  of  the  reference 
point  of  the  template  in  the  observable,  then  the 
set  of  possible  9  is 

©  =  {9\Xe  C  F}  (1) 

where  Xo  =  {x  +  9\x  €  A}. 

In  usual  digital  signal  applications,  F  is  given  by 
the  sampling  process  as  an  evenly  spaced  loca¬ 
tion  set.  A  is  a  chosen  set  whose  spacing  might 
be  uneven,  but  the  spacing  between  each  succes¬ 
sive  pair  is  always  some  multiple  of  the  spacing 
in  F.  When  these  two  are  given,  ©  is  uniquely 
determined  by  the  above  equation. 

Suppose  the  template  is  present  in  the  observ¬ 
able  at  location  ^o,  (that  is,  with  its  reference 
point  located  at  9o),  where  9q  is  a  fixed  con¬ 
stant  from  ©.  In  the  case  of  ideal  observation 
where  there  is  no  pertiurbation,  we  use  s*{x)  to 
denote  the  value  of  the  observable  at  position 
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re  e  r.  The  template  is  just  a  part  of  the  ideal 
observable,  and 

h{x)  =  s*{x  +  9o)  for  xeX  (2) 

With  additive  iid  Gaussian  perturbation,  we 
will  observe 

s{x)  =  s*(x)  +  u(x)  X  G  r  (3) 

where  the  perturbation  i/(-)  is  defined  on  F,  and 
1/  =  (1/(71),  1^(72),..., t'CTAf))^  ~ 
where  /m  is  the  M  x  M  identity  matrix,  and 
a  positive  scalar,  is  the  variance  of  each  element 
of  I/. 

We  use  S*  =  (s*(7i),s*(72), •  •  •  ,s*(7m))^  and 
5  =  (s(7i),  5(72), . . . ,  s(7m))^  to  denote  the  ob¬ 
servations  made  in  these  two  cases,  and  they  are 
related  by  S  =  S*  + 1/. 

3.3  Criterion  function 

The  MSE  criterion  function  is  used  and  to  be 
minimized  with  respect  to  the  location  param¬ 
eter  0. 

F(S,  0)  =  X;[s(a:n+«)-/i(iCn)]^  {0  €  0)  (4) 

n=l 

In  the  ideal  no  pertmrbation  case,  S  =  S*  and 
F{S*,0o)  =  0.  Therefore  the  true  location  0o 
is  the  solution  to  the  problem  in  the  ideal  case. 
When  there  is  pertmrbation  on  the  observable, 
we  get  S'  =  5  and  the  estimated  location  is 

0  =  arg  min  |  [j(a:n  ■\-0)  —  h{xn)]^  |  (5) 

ln=i  J 


0  that  satisfies  Equations  5  and  6.  Due  to  the 
perturbation  on  S,  0  is  also  perturbed  firom  0o. 
A  gauming  this  perturbation  is  small  and  also 
additive,  we  obtain  0  =  0o  +  A0  where  A0  is 
the  small  perturbation. 

Consider,  for  0  G@, 


9iS,0)  = 


dF{S,0) 

d6 


N 


=  (7) 

n=:l 


In  particular,  we  want  to  study  the  above 
function  at  two  particular  points:  {S*,0q)  and 
(5,0),  namely  the  ideal  noise-firee  observable 
with  the  true  location  of  the  template,  and  the 
perturbed  observable  with  the  estimated  loca¬ 
tion  from  omr  selected  minimization  scheme. 

Assuming  that  the  perturbations  v  and  A0  are 
gma.11,  we  approximate  g{S,0)  by  the  linear 
terms  of  its  Taylor  series  expansion  at  (5*,0o)- 
So,  approximately,  we  write 


9{S,0)  =  g(g*,go)  +  (^^^Q— ) 


-H 


.A0  (8) 


where  we  have  used  the  notation 


dg{S*M 

as 

dg{S*,0o) 

80 


dg{S,0) 

as 

dg{S,0) 

80 


s=s* 

9=9o 

S=S* 

9=9o 


There  are  various  schemes  for  the  minimization. 
Our  requirement  is  that  the  solution  0  firom  the 
chosen  technique  is  a  local  minimum  of  F{S,  0), 

(6) 

m 

at  (5, 0).  Equation  6  also  holds  for  (5*,  0o)  since 
it  is  a  global  minimum  of  Equation  4. 


Since  both  (5*,0o)  and  {S,0)  are  local  minima 
of  F{S,  0)  in  Equation  4,  we  have 

g{S*,0o)  =  0  and  g{S,0)  =  0 


Let 


ag{S*,0o) 

80 


and 


8g{S*,0o) 

as 


3.4  Propagating  the  covariance 

Suppose  that  given  observable  5  =  5,  firom 
some  minimization  scheme  we  have  obtained  a 


and  assuming  that  A  ^  exists,  we  obtain 
A0  =  -  {A~^Y 
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Recall  that  v  ~  N{0,a^lM),  we  have 
E(A0)  =  0 

Cov(A0)  =  •  5^  •  5  • 


part  of  the  template.  This  process  is  repeated 
until  the  output  variance  drops  below  a  specified 
level,  indicating  enough  discriminatory  power 
has  been  achieved. 


Since  9  only  takes  value  firom  the  discrete  set 
0,  the  derivative  ds{xn  +  6)ldB  in  Equation  7 
is  undefined.  We  have  to  define  it  properly  be¬ 
fore  we  can  get  an  expression  of  Cov(A0)  in 
terms  of  H.  Here  we  use  cubic  spline  interpo¬ 
lation  [Press  et  e/.,  1992]  to  obtain  a  contin¬ 
uous  interpolating  function  S(-)  fi:om  S.  The 
ds{x)/dx  is  used  in  place  of  ds{x)fdx.  Af¬ 
ter  some  manipulations  [Liu  and  Haralick,  1997], 
the  final  result  is 


Cov(A0)  =  (T^- 

f  f  h{Xn+i)  —  h{Xn) 


4  ~  I 

V  XN-  iCjV-1 


XN  —  Xff-l 


^h"{XN)+h  (XN-l)  ^  I 


(10) 

-1 


where  h"{x)  =  s*'  {x  -F  9o)  for  x  E  X  and  can 
be  expressed  in  terms  of  5*. 


3.5  Choosing  subset  according  to 
propagated  covariance 

Continuing  the  discussion  in  Section  3.1,  Equa¬ 
tion  10  can  be  used  to  measure  the  discrimi¬ 
natory  power  of  the  template  H.  When  terms 
are  taken  out  from  the  sum  in  Equation  10,  the 
value  of  Cov(A0)  will  increase,  indicating  less 
discriminatory  power. 


4  Experiment 

Experiments  using  synthetic  data  were  con¬ 
ducted  to  test  the  relationship  between  the  vari¬ 
ances  of  input  and  output  perturbation  ob¬ 
tained  in  the  previous  section. 

4.1  Observable,  template,  and  test 
statistics 

We  chose  a  cubic  polynomial  curve  as  the  ideal 
observable,  a  segment  of  which  is  chosen  as  the 
template. 

s*(x)  =  0.05a;® -I- 0.04a:2  -  4a: -I- 2 
(-10  <  a;  <  10) 

hix)  =  s*(a:)  (-2<a:<2) 

i.e.,  6q  =  0.  We  sampled  them  on  the  evenly 
spaced  sampling  locations  with  sampling  inter¬ 
val  0.1.  The  resulting  observable  has  201  points 
and  the  template  has  41  points.  By  plugging  the 
equation  of  the  sampled  template  into  Equa¬ 
tion  10,  we  obtained  the  relationship  between 
the  output  and  input  variances  as  Cov(A0)  » 
(1.69  X  10“®)  X  cr^,  where  is  the  variance  of 
each  element  of  the  input  noise  vector  u. 

For  a  fixed  level  of  input  noise,  specified  by 
the  variance  a  201-dimensional  iid  Gaussian 
noise  vector  u  was  generated  and  added  to  the 
ideal  observable  S*  to  simulate  the  perturbed 
observable  S.  Then  an  exhaustive  search  was 
performed  to  find  the  location  0  that  minimizes 
the  term  in  Equation  4.  The  error  A9  =  6  —  Oq 
was  recorded. 


Suppose  we  have  already  chosen  an  initial  tem¬ 
plate  with  a  small  number  of  points  and  want  to 
add  more  points  to  it  to  increase  the  discrimina¬ 
tory  power,  but  do  not  want  to  include  points 
with  marginal  discriminatory  power.  One  by 
one  we  examine  the  points  not  yet  in  the  tem¬ 
plate  by  adding  only  that  point  to  the  template 
and  calculate  the  output  variance.  The  point 
yielding  the  smallest  variance  is  chosen  to  be 


Since  the  exhaustive  search  was  done  on  a  dis¬ 
crete  set  of  locations,  it  had  a  finite  spatial  reso¬ 
lution.  Let  the  sampling  interval  be  q.  In  order 
for  the  quantization  error  to  be  small,  the  input 
noise  level  should  be  in  the  range  such  that 
q  is  much  less  than  (Tout,  the  standard  deviation 
of  the  output  distribution.  It  is  desirable  that  q 
is  in  the  range  of  -^CTout  to  ^o'out-  We  held  q 
fixed  in  the  experiments  to  be  0.1.  Using  the  re- 
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Figure  2:  Ideal  observable  and  template  used 
in  the  experiment.  Dashed  curve 
is  the  ideal  observable,  and  the  ‘x’ 
marked  part  is  the  template.  This 
continuous  signal  was  evenly  sam¬ 
pled  with  an  interval  of  0.1 

lationship  «  0.00169^^,  the  desired  range 
of  q  requires  to  be  in  the  range  of  approx¬ 
imately  400  to  1600.  This  noise  perturbation 
is  too  large  relative  to  our  signal,  and  our  as¬ 
sumption  of  small  noise  perturbation  was  very 
much  not  satisfied.  So  we  have  to  sacrifice  some 
spatial  resolution  in  the  experiment  and  take 
into  accoimt  this  quantization  in  interpreting 
amd  using  the  statistics  gathered  from  the  ex¬ 
periments.  We  take  this  into  account  by  calcu¬ 
lating  the  expected  variance  xising  Equation  10, 
setting  this  variance  as  the  variance  parameter 
of  a  continuous  Gaussian  distribution  and  then 
quantizing  this  distribution  by  our  sampling  in¬ 
terval.  This  produces  a  discrete  probability  dis¬ 
tribution  with  an  underlying  continuous  Gaus¬ 
sian  distribution.  The  variance  of  this  discrete 
probability  distribution  is  used  as  the  refined 
theoretical  prediction  of  the  output  variance. 

The  noise  generation  and  exhaustive  search 
were  repeated  for  20,000  times  and  the  sample 
distribution  of  A0  was  obtained.  This  sample 
distribution  has  mean  0  and  variance  0.00169cr^ . 

With  Gaussian  random  variables,  the  sufficient 
statistics  are  M  and  S^g,  namely  the  sample 
mean  and  sample  variance  of  A6.  Normalizing 
them  with  the  predicted  output  variance,  we  ob¬ 
tain  the  statistics 


noimalzed  Mrnpi*  mean 


Figure  3:  Plot  of  Ti,  the  normalized  sample 
mean.  Ti  ~  N{0,1)  with  mean  0 
(plotted  in  dashed  line)  and  stan¬ 
dard  deviation  1  (plotted  in  dotted 
line  around  the  mean). 


_  (L-l)SX,+LA0'‘ 

“  Cov(A9) 

where  L  =  20,000  is  the  number  of  samples 
observed  in  the  experiment.  Under  Ho,  Ti 
N{0, 1)  and  T2  ~  Xl-v  Their  distributions  are 
unknown  when  Hq  is  not  true. 

A  series  of  experiments  was  conducted  for  dif¬ 
ferent  levels  of  input  noise  where  samples  of  A6 
were  gathered  and  the  statistics  calculated.  The 
normalized  sample  means  are  plotted  in  Fig¬ 
ure  3.  Figure  4  shows  one  example  of  the  ob¬ 
served  sample  distribution,  predicted  discrete 
distribution,  and  the  underlying  predicted  con¬ 
tinuous  Gaussian  distribution  for  input  noise 
variance  (t^  =  2.78.  In  this  particular  case,  the 
sample  values  of  A6  are  actually  quantized  into 
8  bins.  The  sample  mean  is  6.55x10"^,  the  sam¬ 
ple  variance  is  7.42  x  10""®,  the  predicted  mean 
is  0,  and  the  predicted  variance  is  4.70  x  10“^ 
for  the  underlying  continuous  Gaussian  distri¬ 
bution  and  5.50  x  10“^  for  the  quantized  distri¬ 
bution. 


4.2  Summary  of  the  observed  data 

Figmre  3  shows  the  agreement  on  the  mean  value 
of  the  output  perturbation  between  the  exper- 
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Figure  4:  Plots  of  the  sample  distribution 
(solid  line),  the  predicted  (discrete) 
distribution  (dashed  line),  and  the 
underlying  continuous  distribution 
for  input  variance  =  2.78. 

imental  observation  and  the  theoretical  predic¬ 
tion  in  that  the  variation  of  the  sample  mean 
is  within  six  times  the  standard  deviation  of 
the  predicted  distribution.  For  input  noise  vari¬ 
ances  that  are  not  too  large,  the  quantization  er¬ 
ror  is  not  severe,  and  we  observe  agreement  be¬ 
tween  the  theoretically  predicted  variance  and 
the  observed  sample  variance. 

5  Conclusion 

In  this  paper,  we  applied  the  theory  of  covari¬ 
ance  propagation  to  template  matching  and  ob¬ 
tained  the  variance  of  the  output  of  the  tem¬ 
plate  matching  algorithm  as  a  function  of  the 
covariance  of  the  input  pertmrbation  and  the 
template  itself.  It  is  a  measure  of  the  discrimi¬ 
natory  power  of  the  template,  and  can  be  used 
to  guide  the  construction  of  templates  in  or¬ 
der  to  increase  the  invariance  of  the  templates 
and  potentially  increase  the  robustness  of  tem¬ 
plate  matching  algorithms  to  pertmrbations  that 
do  not  closely  follow  the  proposed  stochastic 
model.  Experiments  on  synthetic  data  agreed 
with  the  theoretical  prediction.  We  axe  now 
conducting  experiments  on  real  data  to  test  the 
performance  of  the  methodology. 
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Abstract 

We  present  an  approach  to  performance 
prediction  of  a  geometric-based  automatic  target 
detection  (ATD)  S5^em.*  Performance  analysis 
of  both  the  probability  of  a  false  match  to 
background  and  the  probability  of  a  match 
(detection)  to  the  tme  target  are  presented.  The 
predictive  models  lead  to  a  means  by  which  to 
automatically  trade-off  the  expected  probability 
of  detection  (Pd)  and  false  alarm  rate  (FAR)  for 
a  given  scene  or  part  of  a  scene,  thereby  allowing 
a  user-specified  set  of  overall  system 
performance  specifications  to  drive  the  system 
parameters.  Results  using  an  adaptive  algorithm 
are  presented.  It  is  seen  in  initial  experimental 
results  that  the  adaptive  approach  is  able  to 
significantly  reduce  the  FAR  at  the  expense  of  a 
relatively  minor  reduction  in  the  Pd. 

1  Introduction 

An  analysis  of  the  matching  method  of  fixing  a 
maximum  model-to-data  local  match  tolerance 
and  measuring  the  observed  fraction  of  model 
matched  to  data  is  presented.  To  arrive  at  a 
tractable  result,  we  have  adopted  a  number  of 
simplified  models,  including  those  for 
atmospheric  attenuation,  targets,  correlation 
between  target  models,  edge  operators,  sensor 
blur,  sampling,  noise,  and  background.  We 


’  This  work  was  sponsored  by  DARPA  under  Army 
Research  Office  contract  TDAAH04-93-C-0052,  and  is 
currently  sponsored  under  Air  Force  contract  F33615-97- 
C-1022. 


believe  these  approximations  to  be  adequate  for 
the  present  analysis,  with  a  goal  of  functioning  as 
a  useable  predictive  system  over  a  broad  range  of 
scenarios,  targets,  and  sensors.  Section  1  gives 
an  overview  of  the  analysis  approach,  which  is 
followed  in  sections  2  and  3  by  the  FAR  and  Pd 
analyses,  respectively.  In  section  4  we  present 
results  of  initial  experiments  using  an  adaptive 
ATD  system. 

TTie  false  alarm  analysis  describes  the 
probability  of  a  false  match  within  a  local  search 
area  as  a  function  of  the  local  clutter  edge  density 
and  correlation,  search  area,  model  set,  and 
algorithm  search  and  match  parameters.  The  Pd 
is  a  flmction  of  the  target  contrast,  size,  range, 
the  atmospheric  attenuation,  the  sensor  optical 
and  detector  point  spread  flmctions,  the  sensor 
sensitivity,  the  properties  of  the  edge  detector, 
and  the  fractional  visibility  of  the  target,  Both 
the  FAR  and  Pd  performance  models  are 
functions  of  the  fraction  /q  of  the  model 
matched  to  the  data  above  which  a  detection  is 
declared.  This  allows  the  FAR  and  Pd  models  to 
be  expressed  as  simultaneous  functions  of  /q, 
and  a  predicted  receiver  operating  characteristic 
(ROC)  curve  calculated.  The  ROC  predictive 
capability  enables  trade-offs  between  the  Pd  and 
FAR  to  be  performed  by  an  adaptive  ATD 
algorithm. 

The  ATD  system  discussed  in  this  paper  is  based 
on  the  matching  of  geometric  features  of  a  model 
to  those  of  extracted  data.  As  such,  the 
probability  of  extracting  the  true  target  versus 
background  features,  and  the  local  tolerances  of 
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the  model-to-data  feature  matches,  tend  to  control 
the  system  Pd  and  FAR.  Thus  far  we  have  only 
analyzed  the  use  of  target  edge  features.  In  the 
future  we  will  extend  this  type  of  analysis  to 
include  other  types  of  geometric  information,  and 
to  optimize  feature  extraction  with  respect  to  the 
expected  Pd  and  FAR  [Doria  1997]. 

2  False  Alarm  Rate  Prediction 

We  present  an  analysis  of  an  operating  mode  in 
vdiich  the  maximum  distance  of  each  local 
model-to-data  edge  match  match  is  fixed,  and  the 
fi-action  of  the  model  matched  to  the  data  is 
measured.  Crimson  and  Huttenlocher  [1994] 
present  an  analysis  of  the  probability  of  false 
match  for  a  generalized  Hausdorff  based 
matching  algorithm  over  translation.  They  show 
expressions  for  the  probability  that  a  given  model 
point  will  overlap  the  area  within  a  fixed  distance 
of  any  data  point  for  non-correspondence  based 
Hausdorff  matching  methods,  and  calculate 
subsequent  false  match  probabilities.  In  [Doria 
1996]  the  false  match  prediction  model  was 
extended  to  include  the  effects  of  correlated 
clutter  edges,  local  model-to-data  matching 
window  size,  correlation  over  translation,  and 
search  with  multiple  correlated  models. 

2.1  False  Alarm  Rate  Calculation 

In  many  ATR  systems,  areas  of  interest  (AOIs) 
are  obtained  from  an  initial  screening  or  focus  of 
attention  mechanism.  These  then  restrict  the  pose 
search  to  a  number  of  sub-areas  of  the  image.  If 
within  any  of  these  AOI's  at  least  one  model  is 
accepted  as  a  match  when  no  target  is  present, 
we  label  this  as  a  false  alarm  for  that  AOI.  Only 
a  single  target  report  is  allowed  per  AOI,  so  that 
multiple  matches  of  different  models  to  different 
locations  within  the  AOI  are  not  counted  twice. 
The  overall  expected  false  alarm  rate  per  image 
is  then  the  sum  of  the  mean  false  alarm 
probability  at  each  AOI: 

^AOJ 

FAR^.„  =  'ZF-aM 

where  denotes  a  particular  AOI,  N is  the 
number  of  AOI's,  and  (n^  )  is  the  probability 
of  a  false  alarm  in  AOR  .  Note  that  because 


instances  of  false  matches  are  not  counted  more 
than  once  for  each  AOI,  the  value  of  Pp^  (n^  ) 
(henceforward  called  simply  Pp^)  is  bounded 
between  0  and  1.  Using  this  terminology,  we 
address  the  prediction  of  the  false  alarm 
probability/*;,^  under  translational  search  for 
multiple  models  as  a  function  of  important 
parameters  affecting  the  probability  of  a  false 
match  for  a  “maximum  fraction  at  a  fixed 
distance”  Hausdorff  based  approach.  The  false 
alarm  rate  prediction  is  based  on  the  combination 
of  search  space,  number  of  models,  complexity  of 
the  non-target  regions,  sensor  characteristics,  and 
range  to  the  AOI’s. 

TTie  FAR  performance  model  addresses  the 
performance  of  the  system  operating  on 
geometric  matches  to  edges.  For  this  analysis  we 
model  the  backgroxmd  as  a  zero  mean  noise 
process  and  analyze  the  performance  of  a  simple 
edge  operator  of  the  form 

gh(h  J)  =  ^0.7-0-^  QJ  +  1) 

gvQJ)  =  i(i-hj)-^(j  +  ly) 

with  gradient  magnitude 


giij)  =  ylgh(hjf+g^(ij) 


Other  more  sophisticated  edge  operators  can  be 
be  analyzed  and  inserted  into  the  model  as  the 
user  desires;  we  have  chosen  a  relatively  simple 
representative  edge  operator  for  the  initial 
analysis. 


The  system  operates  independently  in  each  AOI. 
We  first  estimate  the  expected  density  of 
background  (or  clutter)  edges.  Given  zero  mean 
clutter  of  variance  o’ and  covariance  of  a 


at  an  offset  of  two  pixels,  the  expected  values  of 
gf,  and  are  0,  and  the  variances  of  gf,  and 

g^  are  cr^  =  2al  -  2o .  The  operator  g  then 
has  a  Rayleigh  probability  density  over  the 
backgroimd: 


gOJ) 

2ctI 


cr„ 
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To  relate  the  Pp^  to  the  actual  background  grey 
level  statistics,  we  use  an  adaptively  selected 
local  edge  detector  threshold  for  the  false  alarm 
and  detection  performance  models  and  within  the 
algorithm.  Given  the  above  model  of  an  edge 
operator^,  the  probability  of  detecting  a  false 

—fl¬ 
edge  in  the  background  is  p^=  e  ^ 

viiere  is  the  edge  operator  threshold. 

As  a  first  order  model  of  the  correlation  between 
background  edges,  we  adopt  an  exponential 
fimction  of  distance  for  horizontal  and  vertical 
edges: 


transform,  as  described  in  [Doria  and 
Huttenlocher  1996]. 

The  probability  of  match  of  a  single  model  point 
to  noise  data  is  a  fimction  of  the  density  of  noise 
(background  clutter  and  noise)  edges,  the 
covariance  of  the  background  edges,  and  the  size 
of  the  tolerance  window  surrounding  each  model 
point.  First  we  will  find  the  probability  that  a 
single  model  point  m  is  matched  to  a  data  point. 
For  a  1 X  1  tolerance  window,  the  expected  value 
of  the  sum  of  window  points  surrounding  the 
model  edge  point  m  is 

and  for  a  3x3  window  the  expected  value  of 
the  sum  is 


These,  along  with  the  clutter  edge  density  p^csn 
be  estimated  from  the  detected  clutter  edges. 

2.2  Matches  at  a  Single  Model  Point 

The  Hausdorff  based  matching  approach  that  we 
are  studying  operates  by  matching  model  features 
(in  this  case  edges)  to  observed  data.  A  distance 
hj  is  used  to  allow  or  disallow  model-to-data 
edge  matches.  If  at  least  one  data  edge  is  within 
the  distance  to  a  given  model  edge,  a  match  is 
counted  for  the  model  edge.  Use  of  a  specified 
tolerance  distance  translates  into  the  specification 
of  a  local  window  around  each  model  edge  point 
within  which  data  edges  can  occur  for  an  allowed 

match.  For  h.=—  there  is  a  corresponding 
2 

1  X  Iwindow,  and  for  =  V2  there  is  a  3x3 
local  window.  We  will  focus  on  the  performance 
of  both  1x1  and  3x3  local  windows. 
Alternatively,  we  can  use  an  inverted  distance 


^  Note  that  other  operators  or  feature  types  can  be 
used;  analysis  of  their  performance  in  the  presence  of 
target  and  clutter  would  then  be  used  instead  of 
edges  within  this  general  modeling  method. 
Application  to  other  sensor  types  is  also  feasible. 


where  p^  is  the  estimated  density  of  background 
noise  or  clutter  edges.  The  variance  of  the  sum 
over  the  window  points  is,  when  the  pixels  are 
independent, 

(y]  P,) 

where  N  is  the  nimiber  of  pixels  m  the  local 
window. 

If  the  edge  pixels  in  the  window  are  not 
independent,  then  the  variance  of  the  sum  is 

M=1  v=l  m'  =  1v'  =  1 

where 

ct(m,v,m',v')  =  -  a) 

When  the  edge  pixels  are  independent,  the 
probability  of  a  match  of  a  model  point  to  at 
least  one  background  data  edge  is 
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In  the  case  of  correlated  edges  in  a  3x3  (or 
larger)  window,  we  can  calculate  the  binomial 
statistics  directly.  Alternatively,  we  can  estimate 
using  a  Gaussian  approximation  for  the  sum: 


The  statistic  is  the  sum  of  Bernoulli  random 
variables,  and  can  be  approximated  by  a 
Gaussian  distribution  N(Ug  ,<7^) .  The 
expected  value  of  the  sum  for  the  background 
pixels  is  Wg  =  Mp^  .  The  variance  of 
then  the  sum  of  the  covariances  of  the  model-to- 
data  match  windows: 


where  Ois  the  standard  Gaussian  distribution 
function.  The  approximation  to  a  Gaussian 
density  is  not  extremely  accurate  with  only  nine 
terms,  especially  with  low  values  of  u^.  As  an 
alternative,  we  can  estimate  the  equivalent 
number  of  “independent”  pixels  as 


N'  = 


A'V.O-P.) 

1 


and  use  a  Poisson  model: 


M  M 

i=l  j=\ 


where  {i,j)  indicate  model  edge  points.  When  the 
model-to-data  matches  are  independent,  the 
variance  becomes  the  familiar  binomial  variance: 


al  ^MpS^-pJ. 


t'e  " 

Once  we  have  estimated  the  probability  of  a 
match  to  a  single  model  edge  feature,  we  obtain 
the  probability  of  a  match  to  a  fraction  of  the 
edge  pixels  of  the  entire  model,  at  a  single  model 
location  in  clutter.  The  match  of  a  model  to  data 
is  based  on  the  sum  of  model  points  that  have  a 
matching  data  point  within  a  distance  of  the 
model  point. 


It  is  often  the  case  that  the  local  model  to  data 
match  windows  are  not  independent.  This  may 
be  due  to  overlapping  match  windows  around  the 
model  edge  points,  or  statistical  correlation 
betweai  the  data  edges.  For  these  cases  we 
estimate  the  value  of  cr^  from  the  above  sum  of 
covariances.  Let  p.  be  the  correlation 

between  the  random  variables  S„  and  . 
We  use 

(j\-Rs  -  Wm 


N 

Let  values  are  the 

n=l 

binary  edge  values  within  the  window 
surrounding  model  point  m.  We  define  a  new 
binary  random  variable  as: 

[0  otherwise. 

The  sum  of  model  matches  over  the  entire  model 
is  then 

M 

Su='Ls, 

m~\ 

When  is  greater  than  or  equal  to  a  threshold 
fQM,&  detection  is  declared. 


where  for  binary  random  variables 

=  'I'*"  = 

=  =  Oft- 

Letting  »,  =  £(s,|s.  a  1),  and  ie  the 

correlation  between  and  due  to  both 

statistical  clutter  correlation  and  geometric 
window  effects,  the  conditional  expected  value 

^  1)  *  Pn,.m.k  • 

=  a 

^m,m+K 
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The  conditional  variance  can  also  be  estimated 
for  each  (i,j)  combination  of  m  and  m+k^  From 
this,  the  variance  <y\  can  be  obtained. 

A  detection  is  determined  by  the  number  of  model 
edges  that  are  matched  to  at  least  one  data 
edge,  as  described  above.  The  observation 
is  a  sample  from  a  binomial  distribution,  which  is 
well  modeled  as  a  Gaussian  distribution  when 
M  is  large  enough  and  8,^  is  not  too  near  O  or 
M .  A  detection  is  declared  if  >  f^M ,  and 
the  corresponding  probability  of  a  false  match 
Pfa  (/o)  at  a  single  location  of  the  model  is 

approximated  by 


and  Ugis  the  mean  vector.  The  covariance 
matrix  C  is  a  function  of  the  model  size  and 
shape,  and  the  local  model-to-data  tolerance 
window. 

For  this  analysis,  we  represent  models  as 
rectangles  of  dimensions  L^xLy.  As  an 

element  of  the  estimation  of  the  FAR,  we  need  to 
estimate  the  correlation  between  the  model-to- 
data  scores  as  a  function  of  model  horizontal  and 
vertical  translations  Ax:  and  .  This  requires 
an  estimate  of  the  correlation  of  the  model  itself 
as  a  function  of  translation.  We  approximate  the 
model  correlations  PM,y 


dig 

Pm,x  * 

PM,y 

.  4  -  Ax 

(Lx  +  Ly)  —  2 

-  Av 

-0.5 -Ur, 

where  “  1^  - - - 

and 

«•'  0\  Zaa  ) 

(Lx+Ly)-2 

dig  - 


f^M-Ug- 0.5 


CTr 


A  single  correlation  value  is  obtained  from 


PM^xy  y  Pm  ,xPm  ,y 


where  /g  has  a  value  of  —  ,a  e  0,1, 

M 

2.3  Search  Over  an  Area  of  Interest 

As  discussed  earlier,  we  typically  search  each 
AOI  to  determine  if  a  target  exists  within  that 
area.  For  a  single  model,  we  now  look  at  the 
probability  of  a  false  match  within  a  search  area 
of  size  A=Ax  xAy  The  probability  density  of 
the  match  values  is  expressed  by  the  joint 
Gaussian  density  of  the  model  matches  as  a 
function  of  translation  over  the  search  area  A. 

-(i-ub)C~1(x-ub) 

nx)  = - '—e  ^ 

(2»-)2|CjI 

where  C  is  the  covariance  matrix  between  the 
model-to-data  matches  over  translation,  x  is  the 
vector  of  observed  match  scores  over  the  area  A, 


^  See  poria  1996]  for  details. 


2.4  Probability  of  One  or  More  False 
Matches  per  Area 

We  are  interested  in  the  probability  that  at  least 
one  of  the  matches  over  translation  exceeds  a 
fraction  /g  matched  to  the  data.  To  find  this  we 
express  the  matches  at  each  location  as  a 
product  of  conditional  probabilities,  and  then 
estimate  the  probability  that  at  least  one  of  the 
matches  exceeds  a  threshold.  Denoting  the  model 
match  results  at  locations  1,2,..., A  as  x  =  [x;, 
X2,  the  probability  of  false  alarm  over  the 

area  A,  at  a  match  fraction  threshold  of  /g ,  is: 

PFAiA.fQ)  =  \-P{x^  <M^^)P(x2  <M^^\xy  <M^^) 
...P(.x^  <M^^\x^_^  <M^^,Xji_2 

The  conditional  mean  of  a  Gaussian  random 
variable  Xj  =  x, ,  given  observed  variables 


1259 


*2|1  ~  *2  ^21^  ll(*l  *l) 

where  the  covariance  matrix  has  been  partitioned 
as: 

Qi  Q2 

Ql  ^22_ 

and  conditional  variance  is 

^211  ~  ^22  ~  ^21^  11^12- 

Since  we  expect  that  the  conditional  mean,  for 
meaningfully  low  false  alarm  rates,  will  be 
approximately  equal  to  the  unconditional  mean, 
weletX2|i  =  MB  If  we  model  the  search  between 

rows  as  independent,  and  treat  correlation  as  a 
function  of  model  translation  along  a  row  as 
Markov,  then  the  conditional  variance  is 

The  expression  for  PfA{^,fo)  becomes; 


'diA 

A, 

f  ‘12'a  -P  ) 

1  e  2 

j  e  ^  dz 

(2^)J 

Uu  ) 

Uiu  J 

where 


-0.5 -M 


B 


dl\ 


f,M-u,-05  __  ,,,  _  -0.5-Us 


and  dl[ 


Cf'r, 


2.5  Search  With  Multiple  Models 

In  most  real  ATR  problems  the  system  must 
search  an  area  for  more  than  a  single  target  type 
(or  type, view  combination).  In  such  instances  the 
false  alarm  probability  within  a  given  area  is  a 
function  of  the  munber  of  models  and  their 
correlation.  In  this  section,  we  approximate  the 


Edge  Density 

Figure  1 .  Probability  of  false  alarm  as  a 
fimction  of  edge  density  ,  for  both  predicted 
and  Monte-Carlo  experimental  results;  5x5 
search  area,  single  6m  x  3m  model,  1x1 
Hausdorff  matching  window,  /{,  =  .75,  R=6km. 


statistical  correlation  of  model-to-data  matches 
between  models,  and  modify  the  expression  for 
Pp^  accordingly.  To  precisely  model  the  Pp^ 
performance  requires  that  the  covariance  matrix 
between  the  model  types  (or  types  x  views)  be 
known.  For  a  general  prediction  capability, 
however,  a  model  of  approximate  behavior  for  a 
general  set  of  models  is  more  useful.  To  achieve 
this,  we  study  the  case  of  a  set  of  models  of  equal 
size  and  equal  correlation,  so  that  the  covariance 
matrix  between  the  models  is  . 


1  Pm  Pm 

Pm  ■■■  • 

:  •••  Pm 

Pm  ■■■  Pm  ^  . 

where  all  non-diagonal  elements  of  the 
correlation  matrix  are  Pm  =  Pm^  where  p^  is 
the  mean  between-model  normalized  cross 
correlation.'^ 


■'M 


O’! 


''Clearly  this  is  not  descriptive  of  most  actual  model 
sets,  but  is  a  reasonable  first  order  approximation. 
Other  model  set  conelations  can  also  be  adopted  if 
desired,  and  actual  model-set  correlations  can  be 
used  if  known. 
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We  treat  the  search  over  multiple  models  as  an 
independent  search  dimension  at  each 
translational  location  t.  We  now  have,  for  Nj 
models,  a  joint  distribution  of  matches  of  Nt 
models  at  model  location  t.  Expressing  the 
probability  of  a  false  match  as  a  fimction  of 
conditional  probabilities,  we  have 


PM, M,N^,t)  =  \-  P(xi ,  <  f,M)  • 

P{^2,t  <  <  fo^-  •  •  Pi^NrJ  <  fo^\ 

Xy  <  /oM  •  •  • .  <  /o^ 


PFAi-^x’^y’fo’^’^T’PM)~  ^ 


f  foM-Ug-.S 


1 

Nf 

n 

AplyNr 

{iTt) 

2 

m=l 

K 

f 

foM-Ug 

-.5 

N'f 

<^B,m 

n 

I 

e 

m=l 

-■5-ub 

V 

<^B,m 

-■5-ub 

^B,m  J 

\Ay(A„-\) 


J 


By  assuming  joint  Gaussian  statistics  for  the 
above  conditional  probability  densities  for  the  set 
of  models  at  location  t,  we  can  calculate 
conditional  means  and  variances,  and  express  the 
above  equation  as  a  product  of  conditional 
probabilities  of  each  successive  model.  We  again 
set  the  conditional  mean  to  the  value  of  the 
unconditional  mean,  and  the  conditional  variance 
of  model  m  to 


Q|l,m  -  ^22, m  ~  ^21,m^lm^l2,m 


For  the  above  covariance  matrix  this  gives 
conditional  variances  of  either 


^  -  ^)P^M 

+1. 

K  {m-  2)/7„  +  h 


Figure  2.  Probability  of  false  alarm  PpA  as  a 
function  of  model  match  threshold  /g  for  10 
equally  correlated  models  =.75)  of  18m 
contour,  for  both  predicted  and  Monte-Carlo 
experimental  results;  5x5  search  area,  3x3 
matching  window,  edge  density  =  1 ,  R=3km. 

3  Probability  of  Detection  Modeling 


We  now  have  an  expression  for  the  probability  of 
false  match  for  translation  over  a  search  area  A, 
accoimting  for  correlated  background  clutter, 
correlated  local  match  windows,  correlation 
between  models,  and  correlation  of  models  as  a 
function  of  translational  location: 


3.1  Pd  Analysis 

In  this  section  we  model  the  performance  of  the 
Hausdorfif  based  geometric  matching  algorithm 
on  true  targets,  again  using  rectangular  target 
models  of  dimensions  x  Ly .  Let  the 

temperature  difference  between  the  target  and 
background  be  AT  at  range  R.  We  model  the 
atmospheric  attenuation  as  exponential  to  obtain 
an  effective  temperature  difference 
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ATjf  =  ATe  at  the  sensor.  We  adopt  a  first 
order  model  of  a  FUR  sensor  as  a  linear  system 
blur  followed  by  sampling  plus  additive  noise. 
TTie  image  irradiance  fimction  at  the  focal  plane 
is  then 

00  00 

H(iJ)  =  I  j d(x,y)h„pti,,{i  -  xj  -  y)dxdy 

—00  —CO 


P  = - —  in  V ,  where  Pi,  and  P^ 

^FOV  p  "samp  ^samp 

^  V 

^samp 

are  the  respective  number  of  pixel  san:q)les  per 
detector  instantaneous  fields  of  view  in  the  h 
and  V  directions.  The  expected  value  of  the 
output  of  the  edge  operator,  with  a  vertical  edge 
centered  at  location  (ij),  is  calculated  from  the 
ejqjected  edge  operator  h  and  v  responses: 


where  d(x,y)  is  the  image  irradiance  at  the 
input  to  the  sensor.  The  irradiance  function  is 
then  convolved  with  the  detector  sampling 

function  and  noise  of  variance  cr^i  added: 

CO  00 

I(iJ)  =  j  I H(x,y)h^^ (/  -  x,j  -  y)dxdy  +  wQj) 
—00  —00 


Normalizing  for  both  the  noise  equivalent 
temperature  difference  (NEAT)  and  the  edge 
detector  difference  operation  gives  a  unit 
variance  in  both  gf,(i,j)  and  g^hj)  when 
operating  on  uniform  areas  of  the  target  or 
background: 

i(ijy= 

yflNEAT 

and 


""  yflNEAT 


Given  a  target  edge  of  effective  thermal  contrast 
ATji  ,  we  wish  to  estimate  the  output  of  the  edge 
operator  discussed  previously.  To  do  this,  we 
approximate  h^pucs  as  resulting  from  a  circular 

aperture  with  diffraction  limited  blur,  using  the 
mean  wavelength  A  observed  by  the  sensor,  and 
the  effective  sensor  optical  diameter  D . 
Alternatively,  we  can  use  a  measured  or 
estimated  point  spread  function.  We 
approximate  the  detector  sampling  blur  kemal 
/jjgj  as  a  two  dimensional  Rect  function  of  size 
IFOVf,  X  IFOV„ ,  and  the  pixel  dimensions  of 


size 


f’FOV 


IFOV, 


in  h  and 


^  A, 


E(gi,{i,j)) «  ATj^ 


'FOV  2 

j  H(i  +  uJ)du 

IFOVf, 


IFOVf, 


V  ^f'FOV  2 


'FOV  2 

j  H(i  +  u,j)du 

IFOVf, 


<FOV  2 


Sv 

and 

gh=E(gf,(i,j))  =  0 


Letting  2  =  (dropping  the  indices 

(i,j)),  the  density  function  of  the  edge  operator  at 
the  center  of  either  a  vertical  or  horizontal  edge 
transition  is  now  Rician*.  Using  the  normalized 
values  of  the  thermal  contrast,  we  have  unit 
variance  in  gf,  and  ,  and  a  distribution  of  the 
edge  values  of 

Z^+gy 

P(z)  =  ze  2  I^(zg^) 

The  probability  of  detection  of  a  single  target  edge  is 
then 


00  Z^+gy 

Pt,  ^  hi^Sv) 

which  is  Marciun's  Q-function. 


*  We  have  not  included  “off-edge”  true  responses. 
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Given  ,  we  wish  to  estimate  the  probability  Pdi^RjQ,  Lf-R’  IFOV^JFOV^  •  ^optics’  hd) 

that  a  given  model  edge  will  be  matched  to  an  fpM-A-UT 

observed  data  edge  from  the  target®.  For  a  1  x  1  ^  aj  _z^ 

matching  window,  the  probability  of  a  match  is  1  — ==  f  e  dz 

^  -A 

Pt^  =  Pt,  ffT 


For  a  3  X  3  matching  window,  the  target  contour 
passes  through  approximately  three  edge  pixels 
in  the  window;  hence  the  probability  of  a  model- 
to-data  edge  match  is 

ss  1  -  (1  -  Pj.^ 

For  a  target  of  contour  length  Lj.  =  L^L 
meters,  the  number  of  observed  pixels  is: 

f,  . 

^~{Ry(IFOV,)  (RpFov:) 

The  distribution  of  observed  target  pixels 
matched  to  the  correct  target  model  is  binomial 
with  mean 

%  =  PjMj 

and  variance 

a\  =  (1  -  pj^ ) 

For  correlated  tolerance  windows,  we  also  adjust 
the  variance  of  the  model  match  sum  similarly  to 
that  with  the  false  alarm,  arriving  at  a  variance 

r 

a\  .  We  now  estimate  the  probability  of 
detection  of  the  target  by  approximating  the 
binomial  distribution  with  a  Gaussian: 


A  few  examples  indicate  the  predictive  use  of  this 
equation^.  Figure  3  shows  Pd  as  a  fimction  of 
the  matching  fraction  /g  and  a  1x1  model 
matching  tolerance  window.  Figure  4  shows  the 
same  experiment  with  a  3  x  3  tolerance  window. 


0.9 
0.9 
0.7 
0£ 

Pd  0.S 
0.4 
0.0 
Q2 
0.1 
0 

1.00  .90  .60  .70  .60  .50  .40  .30  .10  .00 

fo 

Figure  3.  Predicted  probability  of  detection  as  a 
function  of  detection  threshold  /  for  a  6m  x 

3m  target  and  a  3m  x  2m  target.  Pj^  =.7,  1x1 

tolerance  window,  range  =  3km, 
IFOV  =  75m  rad . 

3.2  Partially  Observable  Target 

It  is  often  the  case  that  a  target  may  be  only 
partially  visible.  This  may  be  the  result  of  either 
very  low  thermal  contrast  over  a  portion  of  the 
target,  partial  occlusion  by  a  portion  of  the 
surroundings,  or  intentional  concealment.  In 
such  cases  the  maximum  observable  number  of 

rr 

target  pixels  is  Mj  =  (1  -  focd  >  where  the 
fractional  non-visibihty  is  .  In  such  cases 
the  equation  for  the  predicted  Pd  becomes 


®We  assume  that  the  target  model  is  positioned  at  the 
correct  location  (i.e.  we  have  not  analyzed  the  search 
process  here;  we  assume  that  a  model  is  positioned, 
by  whatever  means,  at  the  correct  location). 


^The  sensor  parameters  used  in  the  examples  in  this 
paper  were  not  derived  from  nor  do  they  represent 
any  actual  or  potential  sensor  parameters.  Any 
similarity  between  these  parameters  and  those  of  an 
actual  sensor  is  coincidental. 
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Figure  4.  Predicted  probability  of  detection 
as  a  function  of  detection  threshold  /o  for  a  6m 
X  3m  target  and  a  3m  x  2m  target.  Same 
parameters  as  in  Figure  3  except  now  the  system 
is  using  a  3  X  3  tolerance  window  for  matching 
each  model  point  to  an  edge  point. 


(ATji  ,fQ,Lj.R.  ,  NEAT.  IFOV/, ,  IFOV^ ,  ,  hj ,  = 


Pd 


Mj.  -  u" -.5 


i  _  -  J 

J  e  2  (fe  +  I 


2  dz 


-.5  - 


Mj.  -u"-.5 


0  otherwise 


To  obtain  an  equivalent  probability  of  detection 
in  the  presence  of  a  partially  visible  target,  the 
threshold  /q  must  be  reduced.  This  increases  the 
expected  FAR,  and  reduces  the  separability  of 
target  and  clutter. 

3.3  ROC  Calculation 

By  obtaining  predictions  of  both  the  Pi/ and 
Ppj^  as  a  function  of  /q,  predicted  receiver 
operating  characteristic  (ROC)  curves  are  able  to 
be  calculated  by  the  system.  This  can  be  done 
for  each  local  area  of  interest,  using  the  estimated 
clutter  characteristics  from  that  area.  An 


example  of  an  ROC  curve  predicted  using  the 
Pd  and  PpA  models  is  given  in  Figure  5. 

4  Adaptive  Detection  System 

Performance  specifications  for  ATD/R  systems 
are  typically  expressed  in  terms  of  minimum  Pd 
and  maximum  FAR  levels.  Hie  relationship 
between  these  specifications  and  actual  algorithm 
performance  has  in  the  past,  for  the  most  part, 
been  determined  empirically  for  each  set  of 
training  data.  Test  ^ta  that  deviated  from  the 
training  data  often  has  resulted  in  unpredictable 
and  uncontrollable  performance  using  ATD/R 
parameters  trained  or  tuned  for  a  particular 
database.  This  is  generally  true  for  both 


Pfa 

Figure  5.  Predicted  probability  of  detection  Pd 
as  a  function  of  Pfa  for  a  set  of  10  6m  x  3m 
models  with  -.15,  target  dimensions  of  6m 
X  3m,  R=3km,  IFOV  =  15urad ,  target 
p  =7^  5x5  pixel  search  area,  background 
edge  density  Pg  =  14,  and  a  3  x  3  Hausdorff 
tolerance  window. 

statistical  pattern  recognition  and  model  based 
approaches,  although  model  based  approaches 
are  usually  easier  to  re-tune.  The  introduction  of 
predictive  relationships  between  the  algorithm 
match  quality  statistic  and  the  Pd  and  FAR 
enables  a  mechanism  whereby  the  adaptive 
ATD/R  system  can  calculate  a  range  of  model  fit 
values  that  are  within  Pd  and  FAR  specifications 
set  by  the  user. 
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In  an  initial  set  of  experiments,  we  used  a  test 
d^{'s) 

statistic  V  = - ,  where 

PM) 

M 

=  /d(*m)  is  the 


m=l 


value  of  a  clipped  inverted  distance  transform 
[Doria  and  Huttenlocher  1996], 


In  the  future  we  will  also  be  running  experiments 


with  a  test  statistic  defined  as 


y  = 


PA/o) 

PM)’ 


vdiere  /<,  is  an  observed  best  fraction  of  model 
matched  to  data.  Results  from  initial  tests  on 
sample  FLIR  target  data  from  the  UGV  RSTA 
program  are  shown  in  Figure  6.  For  these 
detection  level  tests,  a  screener  was  developed 
that  combines  use  of  both  thermal  contrast  and 
geometric  information  to  improve  initial  target 
detection.  The  number  of  screener  AOI’s  was  set 
to  a  maximum  number  of  expected  targets  in  the 
scene  (in  this  case  6).  Following  the  screening 
function,  potential  target  areas  of  interest  were 
passed  on  to  the  matching  (detection  level) 
system.  Both  adaptive  and  non-adapative 
detection  level  match  statistics  were  generated. 
As  seen  in  the  figure,  a  large  fraction  of  the  false 
alarms  were  eliminated  by  the  adaptive  system  at 
the  expaise  of  a  small  loss  in  probability  of 
detection.  These  results  indicate  the  potential 
value  of  performing  this  type  of  adaptive  analysis 
and  processing  as  a  component  of  model  based 
automatic  target  detection,  recognition,  and 
image  exploitation  systems. 


Adaptive 

Figure  6.  Example  of  adaptive  and  non-adaptive 
processing  on  a  set  of  10  frames  from  the  RSTA 
database. 


(0 


Figure  7  Results  of  adaptive  algorithm  on 
sample  RSTA  FLIR  image:  (a)  original  image, 
(b)  edge  image,  (c)  Pfa  image  at  fo  =  M  (wiiite 
indicates  high  probability  of  a  false  alarm  in  the 
local  area)  (d)  )  Pfa  image  at  f,  =  .95,  (e) 
detection  evidence,  (f)  adaptive  detection  statistic 
at  fo  =  .80,  (g)  adaptive  detection  statistic  at  fo  = 
.95,  Oi)  detections  at  detection  statistic  /,  =  .80, 
(i)  detections  at  detection  statistic  f  =  .95. 
System  had  Pd  =  1 .0  and  1  FA/image  at/,  =  .95, 
and  Pd  =  1 .0  and  0  FA/image  at/  =  .  80. 
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5  Summary 

We  have  presented  a  performance  model  for  an 
automatic  target  detection  system  that  operates 
on  model  information  contained  in  geometric 
features.  The  extraction  of  edge  based  geometric 
information  is  related  to  target  thermal  contrast 
in  the  present  analysis.  The  false  alarm  rate  is 
seen  to  be  a  function  of  the  local  search  area,  the 
clutter  density  and  spatial  correlation,  the  number 
of  target  models  used  in  the  detection-level  search 
process,  the  size  of  the  models,  the  correlation 
between  models,  the  sensor  sampling 
characteristics,  the  algorithm  match  quality 
tolerance,  and  the  maximum  fraction  of  a  model 
matched  to  the  data.  The  probability  of  detection 
is  given  as  a  function  of  the  effective  target 
temperature,  atmospheric  attenuation,  range-to- 
target,  the  sensor  optics,  sampling,  and 
sensitivity,  the  model  size,  the  match  tolerance 
window  size,  and  the  expected  fraction  of  the 
model  features  matched  to  data  features.  We  will 
continue  to  evalueate  the  generalizing 
assumptions  used  in  the  present  development 
with  respect  to  their  extensibility  and  accuracy 
for  performance  prediction  on  a  wide  variety  of 
scenarios.  By  combining  the  PfA  and  Pd  models, 
expected  receiver  operating  characteristic  (ROC) 
curves  are  calculable  using  this  approach.  The 
fact  that  terms  arising  from  the  important  factors 
controlling  both  the  Pd  and  FAR  allows  trade¬ 
offs  with  parameters  describing  each  term,  and 
contributes  to  a  general  understanding  of  the 
relationship  between  these  parameters  in  terms  of 
the  basic  ATR  performance  metrics  of 
probability  of  detection  and  false  alarm  rate  for 
this  type  of  algorithm.  Trades  between  Pd  and 
FAR  are  also  now  possible  for  different  sensors 
and  scenarios. 

This  type  of  algorithm  performance  model  can 
also  be  extended  to  predict  recognition  and 
identification  capability,  as  discussed  in  [Doria 
1997].  It  is  likely  that  the  performance  of  other 
types  of  automatic  target  detection  and 
recognition  systems  that  rely  on  combined  shape 
and  signature  information,  and  that  use  local 
features,  is  also  related  in  a  general  way  to  the 
development  presented  here. 
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Abstract 

We  analyze  the  use  of  linear  models  for  infrared 
(IR)  spectral  reflectance  functions.  These  models  have 
been  studied  extensively  for  visible  wavelengths  and 
form  the  basis  of  several  approaches  to  estimating  sur¬ 
face  properties  from  color  images.  The  infrared  anal¬ 
ysis  is  performed  using  a  set  of  measured  spectral  re¬ 
flectance  data  over  the  mid-wave  and  long-wave  re¬ 
gions  of  the  IR  spectrum.  The  results  show  that  low- 
dimension  linear  models  provide  an  accurate  approxi¬ 
mation  to  a  large  collection  of  natural  and  manmade 
materials. 


1  Introduction 

The  availability  of  multispectral  and  hyperspec- 
tral  infrared  imagers  provides  the  opportunity  to  de¬ 
velop  image  understanding  systems  with  capabilities 
far  exceeding  those  of  systems  using  visible  wave¬ 
length  sensors.  Several  multiband  infrared  sensors 
have  been  demonstrated  spanning  various  wavelength 
ranges.  These  include  HYDICE  for  the  near-IR  (210 
bands  over  0.4  —  2.5/im),  ARES  for  the  mid-wave  IR 
(75  bands  over  2. 0-6. 3  fim),  and  AES  for  the  long-wave 
IR  (26,000  bands  over  2.3  —  15.4/xm).  Data  from  these 
sensors  can  be  exploited  by  systems  to  achieve  tasks 
such  as  object  recognition  and  terrain  classification  in 
various  environments. 

The  large  number  of  bands  captured  by  hyperspec- 
tral  infrared  sensors  presents  a  challenge  to  modules 
which  are  required  to  represent  and  process  this  data. 
For  visible  wavelengths,  low-dimensional  linear  models 
for  spectral  reflectance  have  been  used  to  reduce  the 
dimensionality  of  spectral  representations.  The  use 


of  these  models  is  justified  by  several  studies  [1]  [5] 
[7]  which  show  that  visible  spectral  reflectance  func¬ 
tions  can  be  approximated  accurately  using  a  linear 
combination  of  a  few  fixed  basis  functions.  These 
models  exploit  the  structure  inherent  in  spectral  re¬ 
flectance  functions  to  improve  representational  effi¬ 
ciency  for  many  applications  [6]  [11]. 


In  addition  to  providing  representational  efficiency, 
linear  models  also  play  an  important  role  in  several 
color  constancy  algorithms  which  are  designed  to  re¬ 
cover  illumination-invariant  surface  information  from 
multispectral  images  [3].  In  general,  color  constancy 
is  an  underconstrained  problem  for  which  the  struc¬ 
ture  introduced  by  linear  reflectance  models  permits  a 
solution.  The  use  of  linear  models  also  allows  explicit 
characterization  of  the  set  of  surfaces  for  which  an  al¬ 
gorithm  will  recover  illumination-invariant  surface  de¬ 
scriptors.  Several  methods  derived  from  linear  models 
have  been  used  for  recognition  in  complex  scenes  with 
uncontrolled  illumination  [2]  [4]  [10]. 


In  this  paper,  we  examine  the  use  of  low¬ 
dimensional  linear  models  for  infrared  reflectance 
functions.  The  study  uses  measured  reflectance  data 
for  88  natural  and  manmade  materials.  For  the  anal¬ 
ysis,  the  infrared  wavelengths  are  partitioned  into  the 
mid-wave  spectral  window  of  the  atmosphere  (MWIR) 
from  3/im  —  bfim  and  the  long-wave  spectral  window 
of  the  atmosphere  (LWIR)  from  8/zm  —  12.5/im.  The 
results  indicate  that  infrared  spectral  reflectance  func¬ 
tions  and  spectral  emissivity  functions  can  be  repre¬ 
sented  accurately  using  linear  models  with  a  small 
number  of  parameters. 
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2  Linear  Reflectance  Models 

Consider  a  set  of  M  materials  with  spectral 
reflectance  functions  si(A),  S2('^),  •  •  •  i  -A-n  n- 

dimensional  linear  reflectance  model  for  these  ma¬ 
terials  is  deflned  by  a  set  of  n  basis  functions 
5i(A),52(A),...,5'„(A)  where  each  reflectance  func¬ 
tion  is  approximated  by 

s«(A)  «  ^  <^«f‘Sj(A)  (1) 

l<i<n 

Thus,  the  linear  reflectance  model  allows  each  re¬ 
flectance  function  s,(A)  to  be  represented  using  the 
n  weights  o-.i,  o-i2,  •  •  • ,  o',!,.  For  systems  starting  from 
measured  data,  the  spectral  reflectance  functions  Si(A) 
and  the  basis  functions  Sj{X)  are  represented  at  a  set 
of  W  discrete  wavelengths  Ai,  A2, . . . ,  Aw.  The  error 
in  the  approximation  for  a  single  reflectance  Si(A)  is 
given  by 

(«.(A0-  E  (2) 

l<k<W  l<i<n 

The  basis  functions  Sj  (A)  are  typically  selected  to  min¬ 
imize  the  total  square  error 

Er=  (3) 

1<»<M 

using  a  procedure  based  on  the  singular  value  decom¬ 
position.  The  important  question  is  how  large  n  must 
be  to  represent  accurately  the  set  of  reflectance  func¬ 
tions. 


3  IR  Reflectance  Data  Analysis 

Measured  infrared  spectral  reflectance  data  for  45 
manmade  and  43  natural  materials  was  used  for  the 
analysis.  The  data  for  natural  materials  was  obtained 
from  the  Remote  Sensing  Laboratory  at  Johns  Hop¬ 
kins  University  and  includes  measurements  for  vari¬ 
ous  minerals,  vegetation,  rocks,  and  soils.  The  mea¬ 
surement  techniques  are  described  in  [8]  and  [9].  The 
data  for  manmade  materials  was  obtained  from  the 
National  Photographic  Interpretation  Center  and  in¬ 
cludes  measurements  for  concrete,  road  asphalts  and 
tar,  construction  materials,  paints,  and  rooflng  ma¬ 
terials.  From  Kirchhoff’s  law,  the  spectral  emissivity 
£(A)  for  opaque  materials  is  related  to  the  spectral 
reflectance  s(A)  by  e(A)  =  1  —  s(A)  so  that  a  linear 
reflectance  model  captures  spectral  emissivity  as  well. 


The  analysis  was  performed  separately  for  the  mid¬ 
wave  (3/im  —  bum)  and  long-wave  (8^m  -  12.5/im) 
spectral  windows  of  the  atmosphere. 

3.1  Analysis  for  Mid- Wave  IR 

Eighty-seven  of  the  materials  were  considered  over 
3//m  -  bum  using  101  wavelength  samples  separated 
by  0.02//m.  (One  of  the  88  original  materials  did  not 
have  measurements  for  these  wavelengths.)  Before 
fitting  the  model,  each  sampled  reflectance  function 
s{\k)  was  scaled  by  a  constant  so  that 

=  1  (4) 

k 

This  normalization  ensured  that  each  reflectance  func¬ 
tion  was  considered  equally  by  the  fitting  process.  The 
SVD  was  then  used  to  find  the  basis  functions  5)  (A) 
which  minimize  E^.  in  (3).  The  first  five  basis  func¬ 
tions  are  plotted  in  figure  1  and  the  average  error 
Ej. /87  for  models  using  between  2  and  5  basis  func¬ 
tions  is  shown  in  figure  2.  The  error  bars  in  figure  2 
indicate  the  standard  deviation  of  the  Ei  values  for 
each  n.  Figure  3  and  figure  4  show  the  individual  fits 
with  the  largest  error  using  four  and  five  basis  func¬ 
tions.  Figure  5  shows  the  individual  fit  with  the  small¬ 
est  error  using  two  basis  functions. 

We  also  analyzed  the  sets  of  44  manmade  and  43 
natural  materials  separately.  Figure  6  shows  the  de¬ 
pendence  of  the  average  error  on  the  number  of  basis 
functions  for  the  sets  of  natural  and  manmade  ma¬ 
terials.  For  each  value  of  n,  the  collection  of  natural 
materials  has  the  smaller  error.  This  suggests  that  the 
natural  materials  have  a  higher  degree  of  correlation 
with  wavelength  than  the  manmade  materials. 

3.2  Analysis  for  Long- Wave  IR 

All  eighty-eight  materials  were  considered  over  8  — 
12.5/im  using  46  wavelength  samples  separated  by 
O.l^m.  As  for  the  mid- wave  IR,  each  reflectance  func¬ 
tion  WEIS  normalized  to  have  unit  power  as  in  (4)  and 
the  SVD  was  used  to  find  the  basis  functions  which 
minimized  E^ .  The  first  five  basis  functions  are  plot¬ 
ted  in  figure  7  and  the  average  error  E^/88  for  models 
using  between  2  and  5  basis  functions  is  shown  in  fig¬ 
ure  8.  The  error  bars  in  figure  8  indicate  the  standard 
deviation  of  the  Ei  values  for  each  n.  Figures  9  and 
10  show  the  individual  fits  with  the  largest  error  us¬ 
ing  four  and  five  basis  functions.  Figure  11  shows  the 
individual  fit  with  the  smallest  error  using  two  basis 
functions.  Figure  12  shows  that  for  the  long-wave  IR, 
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the  linear  model  error  is  also  larger  for  manmade  ma¬ 
terials  than  for  natural  materials. 


References 

[1]  J.  Cohen.  Dependency  of  the  spectral  reflectance 
curves  of  the  munsell  color  chips.  Psychonomic 
ScL,  1:369,  1964. 

[2]  G.  Healey  and  A.  Jain.  Retrieving  multispec- 
tral  satellite  images  using  physics-based  invariant 
representations.  IEEE  Trans.  Pattern  Anal.  Ma¬ 
chine  Intel!.,  18(8):842-848,  August  1996. 

[3]  G.  Healey  and  Q.-T.  Luong.  Color  in  computer 
vision:  recent  progress.  In  C.H.  Chen,  L.F.  Pau, 
and  P.S.P.  Wang,  editors.  Handbook  of  Pattern 
Recognition  and  Computer  Vision.  World  scien- 
tiflc,  1997. 

[4]  G.  Healey  and  L.  Wang.  Illumination-invariant 
recognition  of  texture  in  color  images.  J.  Opt.  Soc. 
Am.  A,  12(9):1877-1883,  September  1995. 

[5]  L.  Maloney.  Evaluation  of  linear  models  of  surface 
spectral  reflectance  with  small  numbers  of  param¬ 
eters.  J.  Opt.  Soc.  Am.  A,  3(10):1673-1683,  Oc¬ 
tober  1986. 

[6]  D.  Marimont  and  B.  Wandell.  Linear  models  of 
surface  and  illuminant  spectra.  J.  Opt.  Soc.  Am. 
A,  9(11):1905-1913,  November  1992. 

[7]  J.P.S.  Parkkinen,  J.  Hallikainen,  and  T.  Jaaske- 
lainen.  Characteristic  spectra  of  munsell  col¬ 
ors.  Journal  of  the  Optical  Society  of  America 
A,  6:318-322,  1989. 

[8]  J.  Salisbury  and  D.  D’Aria.  Emissivity  of  terres¬ 
trial  materials  in  the  3-5  ptm  atmospheric  win¬ 
dow.  Remote  Sensing  of  Environment,  47:345- 
361,  1994. 

[9]  J.  Salisbury,  A.  Wald,  and  D.  D’Aria.  Thermal- 
infrared  remote  sensing  and  Kirchhoff’s  law,  1. 
laboratory  measurements.  Journal  of  Geophysical 
Research,  99:11,897-11,911,  1994. 

[10]  D.  Slater  and  G.  Healey.  The  illumination- 
invariant  recognition  of  3D  objects  using  local 
color  invariants.  IEEE  Trans.  Pattern  Anal.  Ma¬ 
chine  Intel!,  18(2):206-210,  February  1996. 

[11]  B.  Wandell.  The  synthesis  and  analysis  of  color 
images.  IEEE  Trans.  Pattern  Ana!  Machine  In¬ 
tel!,  PAMI-9(1),  January  1987. 


0.2 

OT  0.1 


0.2 

S  0 

-0.2 

3.5  4  4.5  5 

Wavelength  (urn) 


3.5  4  4.5  5 

Wavelength  (urn) 


Figure  1:  Basis  functions  for  3pim  —  hp-m 


Figure  2:  Average  error  for  ipm  —  bpm 
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Figure  3:  Worst  fit  for  n  =  4  for  3//m  —  5/im 


Figure  6:  Error  for  natural  and  manmade  materials 


Figure  12:  Error  for  natural  and  manmade  materials 
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Abstract 

We  present  an  energy  matrix  representation  for 
multiband  images  which  captures  spatial  and  spec¬ 
tral  properties.  Using  a  physical  model  for  spectral 
reflectance,  we  derive  a  pseudoinverse  method  for  the 
comparison  of  energy  matrices  which  is  invariant  to 
the  spectral  properties  of  the  scene  illumination.  At 
the  same  time,  this  method  determines  the  illumina¬ 
tion  change  matrix  allowing  direct  comparison  of  im¬ 
ages  obtained  under  different  illumination  conditions. 
We  demonstrate  the  performance  of  the  method  for 
both  illumination-invariant  recognition  and  illumina¬ 
tion  correction  on  a  large  set  of  multiband  images. 
The  energy  matrices  are  generated  using  a  small  set 
of  oriented  steerable  Alters.  We  also  demonstrate  that 
a  related  set  of  rotationally  symmetric  Alters  can  be 
used  for  recognition  invariant  to  both  illumination  and 
rotation  and  that  subsequent  processing  can  be  used 
to  recover  the  rotation  angle  of  a  recognized  object. 

1  Introduction 

Features  extracted  from  the  output  of  a  set  of  spa¬ 
tial  filters  are  often  used  for  image  representation  [2] 
[9]  [11].  Depending  on  the  characteristics  of  the  filters, 
these  features  can  encode  a  large  amount  of  informa¬ 
tion  about  orientation,  scale,  and  spatial  structure  in 
an  image.  The  application  of  spatial  filters  to  multi¬ 
band  images  enables  the  construction  of  feature  sets 
which  capture  a  wide  range  of  spectral  and  spatial 
properties.  These  properties  can  be  selected  to  opti¬ 
mize  performance  criteria  for  specific  applications. 

Despite  these  advantages,  filter-based  features  are 
often  not  directly  useful  for  indexing  during  recog¬ 
nition.  Changes  in  the  spectral  properties  of  the 
scene  illumination,  for  example,  can  lead  to  signifi¬ 
cant  changes  in  filter-based  features  for  a  fixed  object. 
This  dependence  on  illumination  conditions  limits  the 
usefulness  of  filter-based  features  for  recognition  to  en¬ 
vironments  where  the  illumination  can  be  controlled. 

The  success  of  Color  Indexing  [14]  for  recognizing 
objects  from  a  large  database  using  only  multispec- 
tral  distribution  information  sparked  interest  in  the 


development  of  methods  which  generalize  the  basic  ap¬ 
proach  for  applicability  to  environments  with  chang¬ 
ing  illumination.  Using  physical  models  for  multiband 
image  formation,  Funt  and  Finlayson  [5]  and  Healey 
and  Slater  [7]  developed  indexing  methods  which  are 
relatively  insensitive  to  scene  illumination  conditions. 
For  the  most  part,  however,  these  methods  disregard 
image  spatial  structure  and  by  themselves  have  been 
shown  to  be  ineffective  for  recognition  in  moderately 
sized  databases  [6]  [7]. 

Previous  work  on  illumination-invariant  recognition 
using  spatial  structure  makes  use  of  a  multiband  cor¬ 
relation  model  [8].  This  model  has  been  exploited  suc¬ 
cessfully  for  recognition  in  the  presence  of  large  illu¬ 
mination  changes.  The  multiband  correlation  model, 
however,  represents  an  image  region  using  six  correla¬ 
tion  functions  which  typically  contain  a  large  amount 
of  redundant  information.  A  computationally  expen¬ 
sive  projection  method  is  used  for  the  illumination- 
invariant  comparison  of  these  functions.  The  complex¬ 
ity  of  the  correlation  model  transformation  in  response 
to  illumination  changes  prohibits  the  use  of  this  model 
for  the  direct  recovery  of  the  illumination  change  ma¬ 
trix. 

In  this  paper,  we  present  a  compact  multiband 
image  representation  based  on  filtered  energy  fea¬ 
tures.  We  show  that  matrices  of  these  features  trans¬ 
form  in  the  same  way  as  the  multiband  sensor  vec¬ 
tors  in  response  to  an  illumination  change.  From 
this  relationship,  we  derive  an  efficient  pseudoinverse 
method  which  simultaneously  provides  a  metric  for 
the  illumination-invariant  comparison  of  energy  ma¬ 
trices  and  computes  the  illumination  change  matrix. 
We  demonstrate  the  use  of  this  method  for  two  classes 
of  recognition  problems  by  considering  sets  of  both 
oriented  and  rotationally  symmetric  filters  to  produce 
the  energy  matrices. 


2  Modeling  Filtered  Color  Images 

Consider  a  color  imaging  system  that  records  N 
measurements  at  each  location  (x,y)  given  by 
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Ii{T,  y)  =  IWsix,  y,  A)/.(A)dA  l<i<N  (1) 

where  s{x,y,X)  is  the  spectral  reflectance  of  the 
surface,  /j(A)  is  the  response  of  the  ith  pho¬ 
toreceptor  class,  and  A  denotes  wavelength.  De¬ 
note  the  multiband  image  by  the  vector  I{x,y)  = 

ih{x,y),l2{x,y),...,lNix,y)f .  Let  y)  be  the 
image  of  a  surface  illuminated  by  spectral  distribution 
/(A)  and  let  I(x,  y)  be  the  image  of  the  same  surface 
illuminated  by  spectral  distribution  /(A).  Using  the 
linear  spectral  reflectance  model 


s{x,y,X)=  Y,  (2) 

l<i<N 

where  the  Sj  (A)  are  fixed  basis  functions,  the  images 
are  related  by  a  linear  transformation  [8] 

I{x,y)  =  Mi{x,y)  (3) 

where  M  is  a-n  N  x_N  matrix  with  elements  that  de¬ 
pend  on  /(A)  and  /(A).  The  accuracy  of  the  linear 
model  in  (2)  has  been  confirmed  by  several  studies  [3] 

^^^Ulinear  filter  h{x,  y)  can  be  applied  to  each  band  of 
I{x,  y)  to  obtain  a  filtered  multiband  image  0(x,  y)  = 

(0i(®,  y),  02(x,  y),...,  On(x,  y)f  where 

Oi{x,y)  =  Iiix,y)*h{x,y)  (4) 

where  *  denotes  two-dimensional  conyplution.  From 
(3),  the  filtered  images  0{x,y)  and  0{x,y)  derived 
from  I{x,  y)  and  I(x,  y)  are  related  by  the  linear  trans¬ 
formation  M  according  to 

0(x,y)  =  Md{x,y).  (5) 

Let  hi{x,y),h2ix,y),...,h„(x,y)  be  a  set  of 

real  positive  linear  filters  hi{x,y)  >  0,(x,y)  G 

(-00,  oo),  i  =  1, 2, . . . ,  n.  Define  the  energy  m  a  fil¬ 
tered  image  band  by 


Eij  =  Y  y)  * 

x^y 

The  energy  matrix  of  the  N  band  image  I{x,  y)  for  the 
set  of  n  filters  is  given  by 

Ei\  Ei2  ■  '  ■  Ein 

E21  E22  ■  •  •  E2n 

Eni  En2  ■  ■  •  Ej^fn  , 

If  this  set  of  filters  is  applied  to  the  multiband  images 
I{x,y)  and  I{x,y),  then  from  (5)  and  (6)  the  corre¬ 
sponding  energy  matrices  are  related  by 

E  =  ME  (8) 


where  M  is  the  linear  transformation  in  (3).  This  re¬ 
lationship  will  be  used  to  derive  methods  for  illumina¬ 
tion  correction  and  illumination-invariant  recognition. 

3  Illumination  Correction  and  Recog¬ 
nition 

3.1  Estimating  Illumination  Changes 

Given  energy  matrices  E  and  E  corresponding  to 
the  same  surface  under  different  illuminants,  the  il¬ 
lumination  change  matrix  M  can  be  estimated.  If 
n  >  N,  then  (8)  is  overdetermined  and  M  can  be 
estimated  using  the  pseudoinverse 

Mp  =  EE'^  (eE'^)  ^  (9) 

If  we  define  the  residual  matrix 

R=E-ME  (10) 

then  Mp  is  the  matrix  M  which  ininimizes  the  sum  of 
the  squares  of  the  elements  of  R.  Since  M  is  the  inatrix 
which  relates  the  original  images  in  (3),  the  matrix  Mp 

can  be  used  to  transform  the  image  I{x,y)  obtained 
under  illumination  T{X)  to  the  corresponding  image 
MpT{x,  y)  which  would  be  obtained  under  illumination 

K>^)-  .  . 

Given  two  images,  an  important  question  for  recog¬ 
nition  is  whether  measured  energy  matrices  Ei  and  E2 
correspond  to  the  same  surface  under  diffpent  illuini- 
nation  conditions.  In  this  context,  the  residual  matrix 
formed  using  the  pseudoinverse 

R=Ei-  [EiEUE2E^)~^]  E2  (11) 

is  a  measure  of  how  well  the  set  of  energy  vectors  for 
image  1  (columns  of  Ei)  can  be  related  to  the  set  of 
energy  vectors  for  image  2  (columns  of  E2)  by  a  single 
linear  transformation.  In  section  4,  the  squared  norm 
of  R 


.l<»<Ari<j<n 

will  be  used  for  the  illumination-invariant  comparison 
of  energy  matrices.  Small  values  of  |J7?|P  indicate  that 
a  linear  relationship  of  the  form  of  (8)  holds  for  Ei 
and  E2  which  is  a  necessary  condition  for  these  energy 
matrices  to  correspond  to  the  same  surface. 

3.2  Filter  Selection 

The  set  of  filters  hi{x,  y),  h2{x,  y),  •  •  • ,  hn{x,  y)  can 
be  selected  to  extract  a  wide  range  of  spatial  charac¬ 
teristics  of  an  input  image.  For  many  applications,  the 
outputs  of  oriented  filters  are  useful  for  representing 
image  spatial  structure.  In  this  paper,  we  will  exam¬ 
ine  the  use  of  steerable  filters  [1]  [4]  [13]  which  capture 
information  about  the  response  of  a  filter  at  any  orien¬ 
tation  using  a  small  set  of  basis  filters.  Many  functions 
are  steerable  including  all  polynoinials  in  a;  and  y  mul¬ 
tiplied  by  a  rotationally  symmetric  function  [4].  Since 
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Figure  1:  Spatial  Response  of  Basis  Filters 


Figure  2:  Frequency  Response  of  Basis  Filters 


the  relationship  in  (8)  requires  a  set  of  filters  with  non- 

negative  values,  we  choose  the  kernel  x^e  .  For 

this  kernel,  a  filter  at  an  arbitrary  orientation  0  can  be 
synthesized  using  a  linear  combination  of  three  basis 
filters  [4]  according  to 


h\x,y)  =  kii9)h°(x,y)+hie)h^°ix,y)+k3(e)h^^°ix,y) 

(13) 

where 


h°(x,y) 

h^%x,y) 

h^^°{x,y) 


and 


X  e 

,  1  ,2 

(“2*  + 


(14) 


jfci(0)  =  l  +  2cos26'  (15) 

k2{0)  =  1  —  cos  20  4- ■'/S  sin  2^ 
jfe3(0)  =  1  —  cos  20  —  \/3  sin  20 


The  set  of  rotationally  invariant  fil¬ 
ters  hi(x,  y),  h^ix,  y), . . hn{x,  y)  which  are  used  for 
recognition  can  be  constructed  using  (16)  for  several 
values  of  the  scale  parameter  a.  Following  recognition 
and  illumination  correction,  the  rotation  angle  of  an 
object  in  the  scene  can  be  estimated  using  oriented 
filters.  The  rotation  angle  estimation  process  is  de¬ 
scribed  in  the  Appendix. 

4  Experimental  Results 

We  conducted  two  sets  of  experiments  to  test  the 
method  developed  in  section  3  for  recognition  and  il¬ 
lumination  correction.  In  the  set  of  experiments  de¬ 
scribed  in  4.1,  the  oriented  filters  were  used  for  recog¬ 
nition  in  the  presence  of  illumination  changes.  In  the 
set  of  experiments  described  in  4.2,  the  rotationally 
symmetric  filters  were  used  for  recognition  in  the  pres¬ 
ence  of  combined  illumination  and  rotation  changes. 
In  each  of  these  experiments,  the  squared  norm  ||R|P 
defined  by  (11)  and  (12)  was  used  for  the  compari¬ 
son  of  energy  matrices.  To  evaluate  the  effectiveness 
of  the  new  approach  over  traditional  approaches,  we 
also  examined  the  use  of  the  direct  distance  between 
energy  matrices  Ei  and  defined  by 

||£;i-£;2|P=  E  E  {n) 

l<i<N  l<j<n 


Thus,  the  output  of  these  three  basis  filters  determines 
the  output  for  a  filter  of  arbitrary  orientation.  Figure 
1  shows  the  basis  filters  h°{x,y),h^°{x,y),h^^°(x,y). 
Figure  2  shows  the  frequency  response  of  these  niters. 

The  use  of  oriented  filters  for  recognition  requires 
that  the  orientation  of  an  object  is  known.  In  appli¬ 
cations  where  object  orientation  is  unknown,  rotation 
invariant  recognition  can  be  achieved  by  using  filters 
which  are  rotationally  symmetric.  A  rotation  invari¬ 
ant  version  of  the  oriented  filter  defined  by  (13)  can 
be  constructed  by  integrating  over  0  to  obtain 

H{x,y)  =  /  h\x,y)de  (16) 

Je 

Figure  3  shows  the  filter  H(x,y)  and  figure  4  shows 
its  frequency  response. 


for  recognition. 

The  multiband  images  for  these  experiments  were 
obtained  at  our  image  acquisition  facility  using  a  Sony 
XC-77  CCD  camera  with  three  Corion  filters  hav¬ 
ing  the  spectral  transmission  bands:  CA-600  (580- 
700nm),  CA-550  f500-600nm),  CA-500  (400-520nm) 
in  conjunction  with  a  RasterOps  TC-PIP  framegrab- 
ber.  A  Newport  model  765  tungsten-halogen  light 
source  filtered  by  a  Corion  BG-38  blue-green  filter  was 
used  to  obtain  nearly  white  illumination.  Yellow,  red, 
and  green  illuminants  were  obtained  by  using  color 
transmission  filters  with  the  Newport  source. 

4.1  Illumination  Invariant  Recognition 

A  database  was  formed  using  images  of  sixteen  ob¬ 
jects  obtained  under  white  illumination.  Each  object 
was  represented  by  the  3x7  energy  matrix  defined  by 
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Figure  3:  Spatial  Response  of  H(x,y) 


H 


Figure  4:  Frequency  Response  of  H(x,y) 


(7).  Six  spatial  filters  defined  by  h°{x,y),h^°{x,y), 
and  h^^°{x,y)  of  (14)  with  <r  =  2  and  a  =  4  were 
used.  The  filtered  energies  were  augmented  by  the  en¬ 
ergies  in  each  of  the  three  unfiltered  color  bands  so 
that  n  =  7. 

A  set  of  forty-eight  test  images  was  assembled  by 
imaging  each  database  object  under  each  of  yellow, 
red,  and  green  illumination.  Each  test  image  was  rep¬ 
resented  by  the  3x7  energy  matrix  defined  above 
and  classified  as  an  instance  of  the  database  object 
with  which  it  has  the  smallest  distance  ||i7|P  defined 
by  (11)  and  (12).  Using  this  method,  each  of  the  48 
test  images  was  classified  correctly.  For  each  match, 
the  illumination  correction  matrix  Mp  was  computed 
using  (9).  According  to  (3),  this  enables  a  test  im¬ 
age  I^{x,y)  to  be  transformed  to  its  appearance  un¬ 
der  the  database  white  illumination  using  Mplj.{x,y). 
For  comparison,  each  of  the  test  images  was  classi¬ 
fied  using  the  direct  distance  \\Ei  —  E2II  defined  by 
(17).  Using  this  distance,  only  5  of  the  48  test  images 
were  classified  correctly.  Table  1  shows  the  distribu¬ 
tion  of  classification  ranks  for  the  48  test  images  using 
IlF^i  —  For  example,  a  rank  of  6  for  a  test  linage 

means  that  the  correct  match  in  the  database  received 
the  sixth  smallest  value  oi  \  \E\  —  E2\'i^ ■ 

Figures  5-10  display  several  database  and  test  im¬ 
ages  along  with  the  corresponding  illumination  cor¬ 
rected  images.  In  each  row,  the  left  image  is  the 


Table  1:  Illumination  Invariant  Recognition  (Direct 
Distance) _ 
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database  image  and  the  right  image  is  the  test  im¬ 
age.  The  middle  image  is  the  illumination  conected 
test  image.  We  see  that  in  most  cases  the  illumination 
corrected  test  image  is  visually  indistinguishable  from 
the  database  image. 

The  plot  in  figure  11  shows  the  set  of  distances 
||RjP  computed  between  a  test  image  of  object  2  (pea¬ 
cock)  under  green  illumination  and  each  database  ob¬ 
ject.  Large  distances  for  a  database  object  are  trun¬ 
cated  at  the  top  of  the  scale.  We  observe  that  the  dis¬ 
tance  for  the  correct  match  (object  2)  is  much  smaller 
than  the  other  distances.  Such  a  distribution  of  dis¬ 
tances  is  typical  for  this  set  of  experiments. 

4.2  Illumination  and  Rotation  Invariant 
Recognition 

The  method  developed  in  section  3  can  be  used 
for  the  recognition  of  objects  following  illumination 
changes  and  plane  rotations  if  a  set  of  rotationally 
symmetric  filters  are  used.  We  conducted  a  set  of  ex¬ 
periments  to  study  the  performance  of  the  algorithm 
for  this  case  using  a  database  of  sixteen  objects  iin- 
aged  under  white  illumination.  For  this  set  of  experi¬ 
ments,  3x4  energy  matrices  were  used.  The  matrices 
were  computed  using  the  three  rotationally  symmet¬ 
ric  filters  H{x,y)  defined  by  (16)  for  a  =  2,  cr  =  4, 
and  (T  =  8  in  conjunction  with  the  energies  in  each  of 
the  three  unfiltered  color  bands.  A  set  of  64  test  im¬ 
ages  was  generated  by  imaging  rotated  versions  of  the 
database  objects  under  white,  yellow,  red,  and  green 
illumination.  As  in  4.1,  each  of  the  64  test  images 
was  classified  as  an  instance  of  the  database  object 
with  which  it  has  the  smallest  distance  ||R|P  defined 
by  (11)  and  (12).  Using  this  approach,  each  of  the  64 
test  images  was  classified  correctly.  For  each  match  of 
test  image  to  database  image,  the  illumination  correc¬ 
tion  matrix  Mp  was  computed  by  (9)  and  the  rotation 
angle  was  computed  by  the  method  described  in  the 
Appendix  using  h°{x,  y),  h^°(x,  y),  and  h^^°ix,  y)  with 

=  8.  Using  the  direct  distance  ||Ei  -  E2\\^  resulted 
in  22  of  the  64  test  images  receiving  the  correct  clas¬ 
sification.  Table  2  shows  the  distribution  of  classifica¬ 
tion  ranks  for  the  64  test  images  using  \\Ei  -  E2\\  ■ 
We  note  that  sixteen  of  the  test  images  are  rotated 
versions  of  database  images  under  the  white  database 
illumination.  As  might  be  expected  given  the  use  of 
rotationally  symmetric  filters,  all  sixteen  of  these  test 
images  were  among  the  22  which  received  the  correct 
classification. 

Figures  12-14  display  several  database  images  and 
test  images  along  with  the  illumination  and  rotation 
corrected  test  images.  In  each  row,  the  test  image  is 
shown  on  the  right  and  the  matching  database  image  is 
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shown  on  the  left.  The  middle  image  in  each  row  is  the 
illumination  and  rotation  corrected  test  image.  For 
each  test  image,  the  rotation  correction  was  accurate 
to  within  3®. 

5  Summary 

We  have  introduced  an  energy  matrix  representa¬ 
tion  for  multiband  images  defined  using  a  small  set 
of  spatial  filters.  Using  an  accurate  linear  model  for 
spectral  refiectance,  we  have  shown  that  energy  ma¬ 
trices  deform  according  to  a  linear  transformation  in 
response  to  a  change  in  the  spectral  properties  of  the 
illumination.  This  relationship  is  the  basis  of  a  pseu¬ 
doinverse  method  for  the  illumination-invariant  com¬ 
parison  of  energy  matrices  which  also  recovers  the 
illumination  change  relating  the  original  multiband 
images.  By  using  rotationally  symmetric  filters,  the 
method  can  be  used  for  the  recognition  of  rotated 
objects  in  the  presence  of  illumination  changes.  A 
large  set  of  experiments  demonstrates  the  effectiveness 
of  the  new  algorithm  compared  to  traditional  energy 
matching  approaches. 

Appendix:  Rotation  Angle  Estimation 


Consider  a  pair  of  color  images  I{x,y)  and  I(x,y) 
which  are  related  by  an  illumination  change  and  a  ro¬ 
tation.  Using  a  set  of  rotationally  symmetric  filters, 
the  illumination  change  Mp  can  be  estimated  using  (9) 

and  the  illumination  corrected  image  Mpl(x,  y)  will  be 
related  to  I(x,y)  by  a  rotation. 

Using  the  filter  h^(x,  y)  defined  by  (13),  let  the  ori¬ 
entation  9i  of  an  image  band  Ii{x,  y)  be  the  value  of  6 
for  which  the  energy 

Ei{e)  =  '^Ii{x,y)*h\x,y)  (18) 

is  maximum.  From  (13),  the  oriented  energy  is  given 
by 


Ei{e)  =  k^{6)Ei{Q'>)  -f  fc2(0)£^<(6O®)  +  fc3(^)£^,(120®) 
which  is  maximized  for 


-  i  arctan  (  V3(^.(60°)  -  ^.(120°))  \ 

’  2  \2F;i(0»)  -  F;,(60®)  -  F;.(120®) ) 

(20) 

Thus,  the  orientation  is  determined  using  the  output 
of  only  three  filters. 

For  N  bands  the  orientation  vector  is  given  by 
9  =  {9i,92,  ■ .  ■  ,9n).  Given  orientation  vectors  9  and 

9  corresponding  to  images  I(x,y)  and  Mpl{x,y),  the 
rotation  angle  a  is  estimated  as  the  average  relative 
rotation  over  the  N  bands  using 


(21) 
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Figure  8:  Painted  Pattern 


Figure  5:  Peacock 


Figure  9:  Box 


Figure  6:  Flowers 


Figure  10:  Shell 


Figure  7:  Trees 
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\  Figure  11:  Database  distances  for  object2  under  green 
illumination 


Table  2:  Illumination/Rotation  Invariant  Recognition 
(Direct  Distance) 


immMmmyMmhmmEmmummim 


ifu 


Figure  12:  Trees 


Figure  14:  Painted  Pattern 
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Spatial  filters  provide  a  useful  and  efficient  means  of 
analyzing  an  input  color  image  into  components  which 
capture  different  spatial  properties.  Representations 
based  on  spatial  filtering  have  restricted  usefulness  for 
recognition,  however,  because  the  output  of  a  spatial 
filter  across  an  image  depends  on  the  scene  illumi¬ 
nation  conditions.  In  this  paper,  we  use  a  physically 
accurate  linear  model  for  spectral  reflectance  to  derive 
invariants  of  distributions  in  spatially  filtered  color  im¬ 
ages  that  do  not  depend  on  the  scene  illumination. 
These  invariants  can  be  used  for  the  illumination- 
invariant  recognition  of  regions  following  an  arbitrary 
linear  filtering  operation.  We  describe  a  method  for 
illumination  correction  based  on  color  distributions 
and  introduce  an  illumination  change  consistency  con¬ 
straint  which  is  useful  for  verifying  matches  obtained 
using  the  invariants.  We  show  using  a  set  of  classi¬ 
fication  experiments  that  the  filtered  distribution  in¬ 
variants  can  significantly  improve  the  capability  of  a 
recognition  system  in  environments  where  illuminar 
tion  cannot  be  controlled. 

1  Introduction 

The  measured  color  distribution  in  the  image  of 
an  object  provides  a  useful  source  of  information  for 
recognition.  Swain  and  Ballard  [13]  have  shown,  for 
example,  that  color  distributions  can  often  be  used  di¬ 
rectly  for  recognition  without  using  corresponding  ge¬ 
ometric  information.  Such  a  method  is  effective  in  sit¬ 
uations  where  the  illumination  spectral  content  is  held 
constant.  As  the  illumination  environment  changes, 
however,  the  color  distribution  measured  in  the  image 
changes  limiting  the  usefulness  of  this  approach. 

Recent  work  on  color  constancy  has  focused  on 
computing  illumination-invariant  descriptors  of  color 
image  regions.  Funt  and  Finlayson  [3]  introduced 
a  method  called  Color  Constant  Color  Indexing 
that  matches  distributions  of  color  ratios  which  are 
illumination-invariant  under  the  coefficient  model  of 
sensor  response.  Healey  and  Slater  [6]  used  a  linear 
spectral  reflectance  model  to  derive  a  set  of  moment 
invariants  of  color  distributions  that  does  not  depend 
on  the  illumination.  Each  of  these  methods  has  been 
shown  to  improve  recognition  accuracy  significantly  in 
the  presence  of  illumination  changes  over  direct  dis¬ 


tribution  matching.  The  representations  used  by  both 
methods,  however,  fail  to  capture  most  of  the  spatial 
information  in  a  color  image.  This  is  perhaps  an  im¬ 
portant  reason  why  these  methods  have  been  shown 
to  lack  the  discriminatory  power  necessary  for  recog¬ 
nition  in  large  databases  [5J  [6]. 

Image  models  which  characterize  spatial  structure 
provide  important  information  that  is  not  captured  by 
color  distributions.  Spatial  structure  is  particularly 
significant  for  textured  regions  which  often  occur  in 
images  of  natural  scenes.  The  analysis  of  gray-scale 
texture  has  received  significant  attention  over  many 
years,  e.g.[l]  [4],  whereas  considerably  less  effort  has 
been  devoted  to  the  study  of  spatial  structure  in  color 
images.  Recently,  the  effectiveness  of  a  color  correla¬ 
tion  model  for  illumination-invariant  recognition  has 
been  demonstrated  on  a  set  of  color  textures  [7] .  The 
disadvantage  of  this  approach  is  that  the  representa¬ 
tion  includes  six  correlation  functions  and  recognition 
is  achieved  by  the  projection  of  these  functions  onto 
the  database  correlation  representation.  For  recogni¬ 
tion  in  large  databases,  efficiency  considerations  sug¬ 
gest  the  use  of  a  more  compact  color  image  represen¬ 
tation. 

Spatial  filters  can  be  used  to  capture  spatial  prop¬ 
erties  of  interest  in  a  color  image.  Filter-based  im¬ 
age  representations  have  the  advantage  that  they  can 
be  computed  efficiently  and  component  filters  can  be 
selected  to  optimize  performance  for  specific  applica¬ 
tions.  In  the  context  of  recognition,  however,  the  out¬ 
puts  of  spatial  filters  depend  on  the  illumination  con¬ 
ditions.  In  this  paper,  we  show  that  color  distributions 
in  the  filtered  image  of  a  surface  undergo  a  linear  coor¬ 
dinate  transform  in  response  to  illumination  changes. 
This  relationship  is  used  to  derive  illumination  invari¬ 
ants  of  distributions  in  spatially  filtered  color  images. 
This  provides  a  powerful  mechanism  for  recognizing 
a  wide  range  of  spatial  patterns  in  color  images  un¬ 
der  unknown  illumination  conditions.  In  addition,  we 
show  that  combining  descriptors  over  several  filtered 
distributions  leads  to  significantly  better  recognition 
rates  than  can  be  achieved  by  using  a  single  filtered 
distribution. 
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2  Representing  Color  Images 

At  each  location  {x,y)  a  color  imaging  system  ob¬ 
tains  the  n  measurements 


Piix,  y)  =  l{\)s{x,  y,  X)MX)dX  1  <  i  <  n  (1) 

where  /(A)  is  the  spectral  power  distribution  of  the 
scene  illumination,  y,  A)  is  the  spectral  reflectance 
of  the  surface,  fi{X)  is  the  sensitivity  of  the  ith  sensor 
class,  and  A  denotes  wavelength. 

The  spectral  reflectance  at  each  location  (x,y)  can 
be  approximated  by 

s(x,y,A)=  (2) 

l<j<m 

where  the  Sj{X)  are  a  set  of  m  fixed  basis  func¬ 
tions.  Such  approximations  have  been  used  previously 
in  color  vision  and  several  researchers  have  analyzed 
their  accuracy  for  large  sets  of  spectral  reflectance 
functions  [2]  [8]  [9].  Three  properly  chosen  basis  func¬ 
tions  have  been  shown  to  provide  accurate  approxima¬ 
tions  to  a  large  range  of  spectral  reflectance  functions. 
We  consider  the  case  where  m  =  n.  Let  p{x,y)  = 

ipiix,y),p2{x,y),...,  Pn{x,  y)f  denote  the  column 
vector  of  sensor  measurements  and  let  o'(x,y)  = 
((Ti(x,y),(T2(x,y),...  (Tn(x,y))^  denote  the  column 
vector  of  spectral  reflectance  function  weights.  Then 

p(x,  y)  =  Acr(x,  y)  (3) 

where  A  is  an  n  x  n  matrix  with  elements 

A,J  =  Jl{X)SjiX)MX)dX  (4) 

Thus,  cr{x,  y)  describes  the  surface  spectral  reflectance 
across  the  image  and  A  depends  on  the  illumination 
/(A)  but  not  on  image  location. 

Suppose  that  two  color  images  are  taken  of  a  scene. 
The  first  image  p{x,  y)  is  obtained  using  illumination 
with  spectral  distribution  /(A)  and  the  second  image 
p{x,  y)  is  obtained  using  illumination  with  spectral  dis¬ 
tribution  /(A).  The  two  images  are  described  by 


p{x,y)  ^  Aa{x,y),  p{x,y)  =  A<r{x,y).  (5) 

If  the  matrices  A  and  A  corresponding  to  /(A)  and  /(A) 
are  nonsingular,  then 

p{x,y)  =  Mp{x,y)  (6) 

where  M  is  an  n  x  n  matrix. 

Suppose  we  apply  the  linear  filter  g{x,y)  to  each 
band  of  p{x,  y)  and  p{x,  y)  to  obtain 


p'{x,  y)  =  {pi{x,  y)  *  g{x,  y), . . . ,  p„(x,  y)  *  g{x,  y)) 


(7) 


^(x,y)  =  (piix,y)*g{x,y),...,pn{x,y)*g(x,y)) 

where  *  denotes  convolution.  From  (7)  and  (8), 
can  write 


(8) 


we 


p'{x,y)  =  Mp’{x,y)  (9) 

Let  H'{-)  be  the  histogram  having  an  n-dimensional 
argument  describing  the  color  distribution  in  the 
filtered  color  image  p'(x,y).  Let  H'{-)  be  the  his¬ 
togram  having  an  n-dimensional  argument  describ¬ 
ing  the  color  distribution  in  the  filtered  color  image 
^(x,y).  Then  from  (9) 

H'iMp)  =  H\p)  (10) 

Thus,  an  illumination  change  causes  distributions  in 
the  filtered  images  to  deform  according  to  a  linear 
coordinate  transformation.  This  relationship  will  be 
used  for  the  derivation  of  illumination  invariants  in 
the  next  section. 

3  Illumination  Invariants 
3.1  Distribution  Invariants 

We  define  illumination  invariants  as  numbers  which 
are  computed  from  an  image  region  which  do  not  de¬ 
pend  on  the  illumination  /(A).  Let  H{-)  be  the  his¬ 
togram  having  an  n-dimensional  argument  describing 
the  color  distribution  in  a  region  of  the  image  p{x,  y). 
Let  H{’)  be  the  histogram  having  an  n-dimensional 
argument  corresponding  to  the  same  region  of  the  im¬ 
age  p{x,y).  Then  from  (6),  the  measured  color  pixel 
distribution  for  the  region  will  deform  according  to 
a  linear  coordinate  transformation  in  response  to  an 
illumination  change  as  given  by 

H{Mp)  =  H{p)  (11) 

Using  this  relationship,  we  have  developed  an  efficient 
method  for  computing  vectors  of  illumination  invari¬ 
ants  from  color  distributions  [6] .  The  method  is  based 
on  a  technique  for  computing  affine  algebraic  momeiit 
invariants  of  functions  proposed  for  shape  recogni¬ 
tion  by  Taubin  and  Cooper  [14].  The  illumination- 
invariant  vectors  can  be  computed  efficiently  in  time 
proportional  to  the  number  of  pixels  that  define  an 
image  region  under  consideration. 

The  method  for  computing  moment  invariants  of 
a  distribution  H{p)  begins  by  transforming  H{p)  to 
a  normalized  distribution  H{Lp)  whose  second  order 
centered  moment  matrix  is  the  identity  matrix.  A 
Cholesky  decomposition  is  employed  to  compute  L 
from  the  second  order  centered  moment  matrix  of 
H{p).  If  such  a  normalization  is  applied  to  distribu¬ 
tions  which  were  originally  related  by  a  linear  co¬ 
ordinate  transformation,  then  the  normalized  distri¬ 
butions  will  be  related  by  an  orthogonal  coordinate 
transformation.  Eigenvalues  of  a  moment  matrix  of 
the  normalized  distribution  will  be  invariant  to  the 
original  linear  coordinate  transformation  and  equiva¬ 
lently  the  illumination  change.  Distribution  invariants 
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have  been  used  as  a  feature  for  image  database  anno¬ 
tation  [10]  and  for  the  illumination-invariant  content- 
based  retrieval  of  regions  in  multispectral  satellite  im¬ 
ages  [5] .  A  full  description  of  the  computation  of  these 
invariants  is  given  in  [6]. 

3.2  Filtered  Distribution  Invariants 

Although  distribution  invariants  have  been  demon¬ 
strated  as  useful  features  for  recognition,  color  distri¬ 
butions  do  not  preserve  information  about  the  spa¬ 
tial  structure  in  a  color  image.  Thus,  distribution  in¬ 
variants  cannot  discriminate  regions  with  significantly 
different  spatial  properties  if  the  regions  have  simi¬ 
lar  color  distributions.  In  experiments  with  a  large 
database  of  multispectral  satellite  images  [5],  for  ex¬ 
ample,  distribution  invariants  were  shown  to  be  use¬ 
ful  as  a  filter  to  reduce  the  set  of  candidate  matches, 
but  by  themselves  lacked  the  discriminatory  power  for 
recognition  in  many  cases. 

Color  images  can  be  analyzed  efficiently  into  com¬ 
ponents  corresponding  to  different  spatial  frequency 
characteristics  using  a  bank  of  spatial  filters.  This 
approach  has  been  applied  successfully  by  many  sys¬ 
tems  for  the  analysis  of  gray-scale  images.  The  filter- 
based  representation  has  the  advantage  that  filters  can 
be  selected  to  emphasize  properties  of  interest  which 
optimize  performance  for  particular  applications.  In 
recognition  systems,  filter-based  representations  can 
provide  effective  features  for  database  indexing.  Un¬ 
fortunately,  as  shown  in  section  2,  distributions  in  spa¬ 
tially  filtered  images  depend  on  the  scene  illumination. 

PYom  (10),  we  see  that  filtered  color  image  distri¬ 
butions  are  related  by  a  linear  coordinate  transforma¬ 
tion.  Thus,  the  moment  invariants  described  in  3.1 
will  provide  illumination-invariant  descriptors  of  dis¬ 
tributions  in  filtered  color  images.  These  filtered  dis¬ 
tribution  invariants  capture  information  about  spatial 
structure  in  an  image  according  to  the  spatial  response 
of  the  filter  used.  Therefore,  filtered  distribution  in¬ 
variants  can  be  combined  with  distribution  invariants 
for  illumination-invariant  recognition.  While  color  dis¬ 
tributions  are  invariant  to  rotation,  scale,  and  orien¬ 
tation  in  a  scene  [12],  geometric  invariance  does  not 
necessarily  hold  for  filtered  distributions.  A  change  of 
scale,  for  example,  alters  the  spatial  frequencies  which 
are  observed  in  an  image  and  a  rotation  changes  the 
orientation  of  spatial  features.  Rotation  invariance 
can  be  achieved,  however,  by  selecting  filters  which 
are  rotationally  invariant. 

We  illustrate  the  use  of  filtered  distribution  invari¬ 
ants  in  figures  1-7.  Figure  1  is  a  color  image  of  a  photo¬ 
graph  of  a  tiger  under  each  of  four  illumination  condi¬ 
tions.  Moving  clockwise  from  the  upper  left  are  images 
obtained  under  white,  red,  yellow,  and  green  illumina¬ 
tion.  Figure  2  displays  each  of  these  images  following 
a  lowpass  Gaussian  filtering  on  each  band.  Figure  3 
displays  each  of  these  images  following  a  highpass  dif¬ 
ference  of  Gaussians  filtering  on  each  band.  Figures 
4-6  are  corresponding  images  of  bushes.  Figure  7(a) 
is  a  plot  of  the  distribution  invariants  computed  for 
each  of  the  four  unfiltered  tiger  and  bush  images.  Six 
invariants  are  computed  for  each  image  but  only  the 
three  largest  are  used  for  display  purposes.  Figure 


7(b)  plots  the  filtered  distribution  invariants  for  the 
lowpass  images  and  figure  7(c)  plots  the  filtered  dis¬ 
tribution  invariants  for  the  highpass  images.  In  each 
case  the  invariants  for  the  different  objects  form  rea¬ 
sonably  compact  separated  clusters  in  the  invariant 
spaces.  Since  each  invariant  space  captures  different 
information,  the  use  of  different  filters  increases  the 
amount  of  discriminatory  power  that  is  available  for 
recognition. 

In  figures  7-9,  we  demonstrate  the  use  of  filtered  dis¬ 
tribution  invariants  for  discriminating  patterns  with 
similar  color  distributions.  Figure  8  is  a  single  color 
pattern  under  the  four  different  illumination  condi¬ 
tions  (white,  red,  yellow,  green).  Figure  9  is  a  similar 
color  pattern  also  imaged  under  each  of  the  four  dif¬ 
ferent  illumination  conditions.  The  color  distributions 
for  the  two  patterns  are  nearly  identical  and  conse¬ 
quently  the  distribution  invariants  are  unable  to  dis¬ 
criminate  the  patterns.  This  is  shown  by  the  overlap 
of  the  ‘o’  (pattern  1  invariants  for  each  illumination 
condition)  and  ‘-f  ’  (pattern  2  invariants  for  each  illu¬ 
mination  condition)  symbols  for  the  unfiltered  distri¬ 
bution  invariants  in  figure  7(d).  However,  invariants 
computed  from  a  highpass  filtered  version  of  the  im¬ 
age  regions  in  figures  8  and  9  separate  well  as  shown  in 
figure  7(d).  Therefore,  filtered  distribution  invariants 
can  be  used  to  recognize  color  patterns  in  the  pres¬ 
ence  of  illumination  changes  based  on  subtle  spatial 
differences  even  when  the  structure  of  the  multispec¬ 
tral  distributions  does  not  allow  recognition. 

4  Illumination  Correction 

From  (6),  a  scene  illumination  change  transforms 
measured  color  pixel  vectors  according  to  multipli¬ 
cation  by  a  matrix  M.  Given  a  database  color  pixel 
distribution  and  an  observed  color  pixel  distribution, 
the  matrix  M  relating  the  two  distributions  can  be 
computed.  This  illumination  correction  matrix  can 
be  used  for  several  purposes.  In  [12],  M  was  used 
during  recognition  to  correct  the  illumination  change 
relating  a  candidate  matching  region  in  an  image  and 
a  database  region.  This  illumination  correction  step 
allowed  a  direct  comparison  of  candidate  matching  re¬ 
gions  during  hypothesis  verification.  The  illumination 
correction  matrix  can  also  be  updated  continuously 
using  color  changes  for  a  known  object  in  the  scene  to 
correct  images  for  scene  illumination  changes.  In  4.1 
we  describe  a  method  for  computing  the  illumination 
correction  matrix  M.  In  4.2  we  introduce  an  illumina¬ 
tion  change  consistency  constraint  which  can  be  used 
to  verify  a  match  hypothesized  by  a  combination  of 
distribution  and  filtered  distribution  invariants. 

4.1  Computing  Illumination  Changes 

The  method  for  computing  the  illumination  change 
matrix  M  proceeds  as  follows.  Let  H(p)  denote  the 
color  pixel  distribution  of  a  database  region  and  let 
H(p)  denote  the  color  pixel  distribution  of  an  ob¬ 
served  image  region.  If  the  two  regions  correspond  to 
the  same  surface  area  under  different  illuminants,  then 
H{Mp)  =  H(p)  as  in  (11).  Transforming  the  original 
distributions  by  the  matrices  L  and  L  described  in  3.1 
will  generate  normalized  distributions  G(p)  —  H(Lp) 
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Figure  4:  Bushes  under  four  illumination  conditions 


Figure  1:  Tiger  under  four  illumination  conditions 


Figure  5:  Lowpass  filtered  bush  images 


Figure  2:  Lowpass  filtered  tiger  images 


Figure  6;  Highpass  filtered  bush  images 


Figure  3:  Highpass  filtered  tiger  images 
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Figure  7:  (a)  Invariants  for  unfiltered  images  ‘o’  - 
tiger,  “+’  -  Dushes,  (b)  Invariants  for  lowpass  images 
‘o’  -  tiger,  “+’  -  bushes,  (c)  Invariants  for  highpass  im¬ 
ages  ‘o’  -  tiger,  ‘-f  ’  -  bushes,  (d)  Computed  Invariants 
‘o’  -  pattern  1,  ‘-t-’  -  pattern  2 


Figure  8:  Pattern  1  under  four  illumination  conditions 


Figure  9:  Pattern  2  under  four  illumination  conditions 


and  G{p)  =  H{Lp)  which  are  related  by  an  orthogonal 
coordinate  transformation 

G{p)  =  G{Op)  (12) 

The  matrix  O  is  determined  by  finding  an  intrinsic  co¬ 
ordinate  system  for  the  distributions  G{p)  and  G{p) 
and  then  computing  the  rotation  which  aligns  these 
systems.  The  simplest  intrinsic  coordinate  system  is 
defined  by  the  eigenvectors  of  a  distribution’s  second 
order  moment  matrix.  This  coordinate  system  cannot 
be  used  for  alignment  in  this  case,  however,  because 

the  second  order  moment  matrices  of  G(p)  and  G{p) 
are  identity  matrices  following  the  normalization  by  L 

and  L  [14].  The  intrinsic  coordinate  system  for  each 
distribution  is  defined  instead  using  another  symmet¬ 
ric  3  x  3  matrix  B  which  is  formed  from  distribution 
moments  and  defined  in  [11]. 

The  matrices  B  and  B  are  computed  from  G{p)  and 

G{p)  respectively.  These  matrices  determine  intrin¬ 
sic  coordinate  systems  for  the  respective  distributions 
and  can  be  aligned  to  determine  the  matrix  O.  Let 
ai,a2,®3  be  orthonormal  eigenvectors  of  B  sorted  by 
eigenvalue  magnitude  and  let  ai,S2,33  be  similarly 
sorted  orthonormal  eigenvectors  of  B.  Let  C  be  the 
3x3  matrix  whose  ith  row  is  given  by  Oj  and  let  C 
be  the  corresponding  matrix  of  Sj’s.  Then 

O  =  CC-^  (13) 

is  the  matrix  which  relates  G{p)  and  G{p)  and 

M  =  LCC-'^L-^  (14) 

is  the  illumination  change  matrix  which  relates  H{p) 
and  H{p)- 

4.2  Illumination  Change  Consistency 

As  described  in  section  3,  a  match  of  distribution 
invariants  combined  with  matches  of  one  or  more  sets 
of  filtered  distribution  invariants  provides  strong  ev¬ 
idence  for  illumination-invariant  recognition.  In  this 
subsection,  we  introduce  an  illumination  change  con¬ 
sistency  constraint  which  can  be  used  to  further  verify 
a  hypothesized  match  based  on  more  than  one  vector 
of  invariants.  Consider  a  surface  viewed  under  two 
different  illumination  conditions.  From  (10)  and  (11), 
we  see  that  the  illumination  change  matrix  relating 
the  color  distributions  and  any  filtered  distributions 
is  the  same  matrix  M.  Therefore,  for  a  hypothesized 
match  based  on  distribution  invariants  and  filtered 
distribution  invariants,  the  matrices  M  relating  each 
pair  of  corresponding  distributions  will  be  the  same. 
This  constraint  can  be  applied  by  evaluating  the  sim¬ 
ilarity  among  the  set  of  illumination  correction  ma¬ 
trices  Ml,  M2,  ■  •  • ,  Mn  which  are  computed  using  the 
method  in  4.1  for  each  of  the  N  distributions  associ¬ 
ated  with  an  image  region.  In  this  work,  we  consider 
the  distance  defined  by 

U  =  ma,x{{mi{k ,  1)  —  rrij  {k ,  l))^}  (15) 

Li 

k,l 
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where  is  the  element  at  row  k,  column  I  of 

matrix  M,-.  Thus,  D  measures  the  similarity  among 
a  set  of  matrices  by  evaluating  the  sum  of  the  maxi¬ 
mum  square  differences  between  corresponding  matrix 
elements.  The  illumination  change  consistency  con¬ 
straint  will  be  examined  experimentally  in  the  next 
section. 

5  Experimental  Results 

We  tested  the  filtered  distribution  invariants  on 
a  database  of  sixteen  objects  imaged  under  differ¬ 
ent  illumination  conditions.  The  images  were  taken 
from  existing  distribution  and  texture  databases  [6] 
[7].  The  database  consisted  of  an  image  of  each 
object  taken  under  nearly  white  illumination  gen¬ 
erated  by  a  Newport  model  765  tungsten-halogen 
light  source  filtered  by  a  Corion  BG-38  blue-green 
filter.  A  set  of  forty-eight  test  images  was  obtained 
by  imaging  each  database  object  under  each  of  yel¬ 
low,  red,  and  green  illumination.  The  images  were 
captured  at  our  image  acquisition  facility  using  a 
monochrome  Sony  XC-77  CCD  camera  in  conjunction 
with  three  Corion  filters  having  the  spectral  sensitiv¬ 
ity  bands:  CA-600  (580nm-700nm),  CA-550  (500nm- 
600nm),  CA-500  (400nm-520nm)  and  a  RasterOps 
TC-PIP  framegrabber.  The  imaging  system  was  con¬ 
figured  for  a  linear  response  which  has  been  verified 
using  patches  of  known  reflectance  and  neutral  filters 
of  known  transmission.  The  images  shown  in  figures  1, 
4,  8,  and  9  were  included  in  the  experiments.  Figures 
10  and  11  show  two  of  the  other  images.  In  each  case, 
the  database  image  obtained  under  white  illumination 
is  shown  in  the  upper  left.  The  test  image  under  red 
illumination  is  shown  in  the  upper  right,  yellow  illu¬ 
mination  in  the  lower  right,  and  green  illumination 
in  the  lower  left.  Each  of  the  underlying  objects  is  a 
photograph  which  was  imaged  under  the  four  different 
illuminants. 

The  invariants  used  for  representing  each  image  in 
the  experiments  were  computed  from  1)  the  irnage 
color  distribution  H{p),  2)  the  color  distribution  in  a 
lowpass  filtered  image  and  3)  the  color  distribu¬ 

tion  in  a  highpass  filtered  image  Hf^{p).  A  2-D  Gaus¬ 
sian  with  <7  =  1.5  pixels  was  used  for  the  lowpass  filter 
and  a  2-D  difference  of  Gaussians  with  (Ti  =  1.5  pixels 
and  (72  =  0.75  pixels  was  used  for  the  highpass  filter. 
Each  filter  was  represented  using  a  9  x  9  mask.  A  vec¬ 
tor  of  six  invariants  was  computed  for  each  of  the  orig¬ 
inal  and  filtered  images  yielding  a  total  of  eighteen  dis¬ 
tribution  and  filtered  distribution  invariants  for  each 
image.  Recognition  experiments  were  conducted  in 
which  each  of  the  forty-eight  test  images  was  classi¬ 
fied  as  an  instance  of  the  database  image  with  which  it 
had  the  smallest  Euclidean  distance  in  invariant  space. 
First,  we  considered  each  invariant  space  separately 
so  that  a  6-dimensional  vector  was  used  for  indexing. 
The  correct  classification  rates  using  each  of  the  indi¬ 
vidual  invariant  spaces  were:  unfiltered  space  (40/48), 
lowpass  space  (35/48),  highpass  space  (35/48).  Sec¬ 
ond,  we  combined  all  three  invariant  vectors  for  each 
image.  The  invariant  vectors  were  compared  using 
the  Euclidean  distance  between  18-dimensional  con¬ 
catenated  vectors.  Using  the  combination  of  the  three 


invariant  vectors  for  each  object  yielded  a  correct  clas¬ 
sification  rate  of  47/48.  The  only  test  image  which  was 
classified  incorrectly  was  texture  1  under  red  illumina¬ 
tion  (figure  10)  which  was  classified  as  an  instance  of 
texture  2  (figure  11).  These  textures  have  similar  color 
and  spatial  properties.  We  note  that  recognition  accu¬ 
racy  is  significantly  improved  by  combining  invariants 
corresponding  to  color  distribution  and  different  fil¬ 
tered  distributions. 

We  examined  the  use  of  the  illumination  change 
consistency  constraint  derived  in  4.2  for  verifying 
matches  when  the  vector  of  18  invariants  generated 
more  than  one  close  match.  All  invariant  distances  be¬ 
tween  a  test  image  and  a  database  image  which  were 
within  0.10  of  the  best  match  were  used  to  signify  a 
candidate  match.  Using  this  criteria,  six  of  the  forty- 
eight  test  images  had  more  than  one  candidate  match 
in  the  database.  One  of  these  six  test  images  was 
the  single  misclassified  image  using  the  combined  in¬ 
variant  vector  distance.  For  each  of  these  images,  the 
distance  D  of  (15)  was  computed  for  each  candidate 
match.  The  candidate  match  with  the  smallest  D  was 
classified  as  the  database  match.  Using  this  procedure, 
each  of  the  six  test  images  with  more  than  one  candi¬ 
date  match  was  classified  correctly.  Therefore,  using 
the  invariant  distance  to  generate  candidate  matches 
and  the  illumination  change  consistency  constraint  for 
verification  lead  to  an  overall  correct  classification  rate 
of  48/48.  .  j.  ^  . 

Using  this  set  of  images,  we  also  studied  the  be¬ 
havior  of  the  illumination  invariants  with  respect  to 
the  spatial  filtering  operations.  For  this  purpose,  we 
define  the  dispersion  d  of  a  set  of  invariant  vectors 
Vi,  V2,...,Vp  by 

‘^=4  E  11^- -^11'  (1®) 

^  i<t<p 

where  V  is  the  mean  vector  of  Ui ,  V2, . . . ,  Vp-  We  com¬ 
puted  d  for  each  set  of  four  distribution  invariant  vec¬ 
tors  corresponding  to  a  single  object  under  the  differ¬ 
ent  illumination  conditions.  Thus,  for  perfect  illumi¬ 
nation  invariance  d  would  be  zero  for  each  object.  We 
also  computed  d  for  each  set  of  four  invariant  vectors 
corresponding  to  the  lowpass  filtered  versions  of  a  sin¬ 
gle  object  under  the  different  illumination  conditions. 
Similarly,  we  computed  d  for  each  set  of  four  invari¬ 
ant  vectors  corresponding  to  the  highpass  filtered  ver¬ 
sions  of  each  object.  The  avpage  dispersion  values 
over  the  16  objects  were:  raw  images  dr  =  0.322,  low- 
pass  images  d;  =  0.377,  highpass  images  d/,  =  0.647. 
The  larger  dispersion  value  for  the  highpass  images 
can  be  attributed  to  the  fact  that  a  significant  frac¬ 
tion  of  the  pixels  in  the  highpass  images  have  values 
near  zero  which  have  little  effect  on  the  distribution 
moments.  Thus,  for  the  highpass  images  smaller  dis¬ 
tribution  samples  contribute  to  the  moment  computa¬ 
tion  leading  to  greater  dispersion  of  the  invariants. 

6  Discussion 

The  spatial  structure  invariants  introduced  in  this 
paper  have  many  applications.  These  invariants  al¬ 
low  for  the  efficient  illumination-invariant  comparison 
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Figure  10:  Texture  1 


Figure  11:  Texture  2 


of  regions  following  an  arbitrary  linear  filtering  op¬ 
eration.  Therefore,  these  invariants  can  be  used  for 
recognizing  a  wide  range  of  deterministic  and  random 
patterns  in  color  images.  As  demonstrated  in  section 
5,  incorporating  spatial  information  by  combining  in¬ 
variants  of  filtered  color  distributions  H'{p)  with  in¬ 
variants  of  the  image  color  distribution  H  (p)  can  sig¬ 
nificantly  improve  the  performance  of  recognition.  An 
important  issue  for  future  work  is  the  selection  of  the 
set  of  spatial  filters  g{x,y)  that  is  used  to  generate 
the  filtered  color  images  p'{x,  y)  that  are  used  to  com¬ 
pute  invariants.  Filters  can  be  designed  that  extract 
a  wide  variety  of  spatial  properties  of  an  input  im¬ 
age.  General  considerations  suggest  the  use  of  a  set  of 
filters  that  provides  a  compact  representation  that  is 
descriptive  enough  to  enable  accurate  recognition.  For 
particular  recognition  problems,  optimized  filter  sets 
can  be  designed  that  maximize  object  discriminabil- 
ity  while  using  a  total  number  of  invariants  that  is  as 
small  as  possible. 
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Abstract 

Template  matching  is  an  effective  but  computa¬ 
tionally  expensive  means  of  locating  vehicles  in 
imagery.  To  reduce  processing  time,  we  use  neu¬ 
ral  networks  to  predict  a  small  subset  of  tem¬ 
plates  most  likely  to  match  each  image  chip.  Re¬ 
sults  on  actual  LADAR  images  show  that  lim¬ 
iting  the  templates  to  those  selected  by  the  net¬ 
work  reduces  the  computation  time  by  a  factor 
of  5  without  sacrificing  identification  accuracy. 

1  Introduction 

Neural  networks  are  often  used  to  extract  com¬ 
plex,  nonlinear  relationships  from  a  data  set.  In 
this  paper,  we  use  this  nonlinear  mapping  to 
reduce  the  computation  associated  with  tem¬ 
plate  matching.  Template  matching  requires 
the  application  of  numerous  templates  to  rect¬ 
angular  image  chips  [Perlovsky  et  al,  1995; 
Katz  et  al,  1993].  Each  template  corresponds 
to  a  particular  type  of  vehicle  at  a  specific  orien¬ 
tation.  After  all  templates  are  applied,  the  best 
matched  templates  are  returned  as  the  most 
likely  vehicles  present  [Bevington,  1992]. 

In  order  to  detect  a  wide  range  of  vehicle  types, 
a  large  number  of  templates  must  be  applied. 
For  this  method  to  be  feasible,  computation 

*  This  work  wsis  sponsored  by  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  (DARPA)  Image  Un¬ 
derstanding  Program  under  grants  DAAH04-93-G-422 
and  DAAH04-95-1-0447,  monitored  by  the  U.  S.  Army 
Research  Office. 


must  be  fairly  efficient.  Unfortunately,  com¬ 
putation  time  is  usually  0(n)  where  n  is  the 
number  of  templates.  Two  methods  are  typi¬ 
cally  used  for  reducing  processing  time:  parallel 
hardware  is  added  [Fang  et  al,  1987],  or  focus- 
of-attention  mechanisms  reduce  the  number  of 
image  chips  [Beveridge  et  al,  1994a]. 

A  different  approach  is  to  index  a  subset  of  the 
possible  templates  to  apply.  Here  we  describe 
how  neural  networks  can  be  used  to  predict  the 
utility  of  applying  each  template  to  a  given  im¬ 
age  chip.  Then  only  those  templates  expected 
to  match  well  need  to  be  applied.  We  show  that 
our  neural  network  indexing  reduces  the  com¬ 
putation  time  by  a  factor  of  5  without  sacrific¬ 
ing  identification  accuracy  on  a  set  of  real  im¬ 
ages.  A  more  detailed  explanation  can  be  found 
in  [Stevens  et  al,  1997]. 

2  Neural  Networks  and  Template 
Probing 

Our  neural  network  indexing  method  is  based 
on  template  probing.  Template  probing  is  a  cor¬ 
relation  technique  which  matches  stored  tem¬ 
plates  to  image  chips.  Each  template  represents 
the  vehicle  at  a  specific  location,  and  consists 
of  a  set  of  probe  test  points  for  locating  depth 
discontinuities  along  the  object  contour.  These 
templates  are  derived  off-line  from  rendered  im¬ 
ages  of  3D  CAD  models  at  specific  locations  and 
orientations  in  the  image.  From  these  render¬ 
ings  a  co-occurrence  matrix  is  derived. 

A  co-occurrence  matrix  is  a  sparse,  binary  ma- 
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trix  of  size  nxn  where  n  is  the  number  of  pixels 
in  the  rendered  image.  Matrix  elements  are  1 
where  the  probe  test  points  have  a  difference 
in  depth  greater  than  some  threshold.  When  a 
new  image  chip  is  input,  a  co-occurrence  ma¬ 
trix  is  formed  and  exhaustively  compared  to  all 
the  templates.  The  ratio  of  similarities  to  total 
comparisons  gives  a  quality  of  match  score. 

Instead  of  the  exhaustive  comparison,  we  use  a 
neural  network  to  predict  the  25  best  templates 
(out  of  3, 800)  to  apply  to  each  chip.  Processing 
time  is  greatly  reduced  at  the  expense  of  adding 
the  small  amount  of  internal  network  calcula¬ 
tions.  Each  network  (one  per  vehicle)  receives 
the  sparse  co-occurrence  matrix  as  input.  The 
outputs  are  predictions  of  the  degree  of  match 
between  the  matrix  and  each  template.  Rather 
than  requiring  the  networks  to  learn  accurate 
predictions,  we  only  assume  they  are  correct  in 
relative  ranking. 

Each  network  is  trained  with  standard  error 
back  propagation  [Rumelhart  et  al.,  1986].  The 
training  data  was  generated  from  250  synthetic 
renderings  of  the  various  (3  were  used  here)  ve¬ 
hicles  placed  in  the  scene  at  random  orienta¬ 
tions.  For  each  chip  in  these  images,  the  co¬ 
occurrence  matrix  is  used  with  the  exhaustive 
technique  to  generate  the  match  scores.  The 
training  set  is  formed  by  pairing  each  matrix 
with  the  match  scores.  The  best  network  topol¬ 
ogy  and  learning  parameters  were  found  em¬ 
pirically  by  training  the  system  and  observing 
performance.  W^e  found  that  a  five  unit  hidden 
layer  with  a  learning  rate  of  0.0001  and  momen¬ 
tum  of  0.1  worked  well. 

3  Results  on  Actual  LADAR  Data 

As  described  in  the  previous  section,  the  net¬ 
works  were  trained  with  synthetic  imagery.  The 
networks  were  the  tested  on  15  real  LADAR  im¬ 
ages^  from  the  Fort  Carson  data  set. 

When  the  exhaustive  template  probing  was 
used,  it  found  the  correct  vehicle  in  11  of  the 
15  images  examined.  We  then  looked  at  the 
number  of  times  the  network  prediction  failed 
to  predict  the  top  25  correct  templates.  Out  of 

'The  Fort  Carson  data  collection  is  publicly  at: 
http :  //www .  cs .  colostate .  edu/~vision 


the  375  (25  templates  x  15  images)  templates, 
the  network  prediction  failed  only  14  times  for  a 
success  rate  of  about  96%,  while  still  achieving 
the  same  identification  rate. 

The  neural  network  selection  of  salient  tem¬ 
plates  decreases  processing  time.  Figure  1 
shows  the  run-times  for  all  chips  examined  in 
the  15  images.  Figure  la  shows  the  cumulative 
time,  and  Figure  lb  shows  the  time  in  seconds 
for  each  window  using  the  two  approaches.  The 
experiments  were  run  simultaneously  on  a  Sparc 
20.  The  drop  in  time  per  window  for  the  ex¬ 
haustive  method  corresponds  to  the  completion 
of  the  network  indexing  method. 
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0 

a.  Cumulative  time  b.  Time  per  window 
Figure  1:  Execution  times. 
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Abstract 

Evaluation  of  large,  complex,  research  and  devel¬ 
opment  programs  requires  a  new,  and  more  struc¬ 
tured  approach.  The  approach  developed  and  used 
here  is  an  evaluation  framework  whose  evaluation 
elements  are  based  on  detailed  consensus-based 
evaluation  plans.  Implementation  of  this  approach 
is  illustrated  in  a  description  of  the  evaluations 
designed  and  performed  by  the  authors  for  the  De¬ 
fense  Advanced  Research  Projects  Agency  Un¬ 
manned  Ground  Vehicle  (DARPA/UGV)  pro¬ 
gram.  Current  data  is  presented  on  the  evaluations 
of  the  UMASS  color  detection  and  the  Hughes 
FLIR  detection  algorithms  for  the  Ft.  Carson  col¬ 
lection. 

Implementation  of  the  evaluation  approach  for 
UGV/RSTA^ 

Background 

The  RSTA  module  of  the  UGV  program  was  de¬ 
veloped  around  the  assumption  of  the  use  of  three 
sensor  modalities:  Color  CCD,  8-14  micron  FLIR, 
and  LADAR.  The  operational  concept  prescribed 
a  two  stage  approach,  detection  followed  by  iden¬ 
tification.  The  approach  was  to  encompass  many 
performers  who  were  to  develop  algorithms  for 
these  specific  sensors  or  for  a  combination  of  sen¬ 
sors.  In  general,  the  detection  would  be  per¬ 
formed  by  either  a  Color  CCD  sensor  or  with  a 

‘  This  work  was  sponsored  by  the  Defense  Advanced  Re¬ 
search  Projects  Agency  under  contract  number  DAAHOl-93- 
CR113  and  monitored  by  the  U.  S.  ARMY  Missile  Com¬ 
mand.  It  is  an  SBIR  effort. 


FLIR.  The  detection  result  would  be  passed  to  a 
different  algorithm  that  would  use  the  FLIR  or 
LADAR  sensors,  either  singly  or  in  combination 
to  perform  identification. 

The  initial  sensor  design  had  fixed  a  number  of 
the  important  characteristics  and  operational  con¬ 
ditions  in  a  manner  that  had  a  substantial  impact 
on  the  program  and  ultimately  the  evaluations. 
The  most  critical  of  these  decisions  was  to  de¬ 
velop  a  LADAR  sensor  specifically  for  this  pro¬ 
gram.  The  sensor  was  not  built  in  time  to  provide 
any  training,  testing  or  evaluation  imagery, 
thereby  making  it  impossible  to  perform  any  for¬ 
mal  evaluation  of  algorithms  based  on  LADAR 
imagery.  The  second  critical  design  decision 
concerned  the  focal  lengths  of  the  lenses  for  the 
Color  and  FLIR  sensors.  The  design  philosophy 
was  to  use  a  wide  field  of  view  (11.11  degrees) 
lens  for  detection  and  a  narrow  field  of  view  lens 
(1.23  degrees)  for  identification.  These  design 
constraints,  with  the  focal  plane  pixel  size,  spe- 


Figure  1  RSTA  POT  vs.  RTT 


1291 


cific  targets,  and  the  operational  ranges  to  target 
resulted  in  a  design  that  produced  a  specific  pixels 
on  target  verses  range-to-target  relationship  (see 
Figure  1).  At  the  anticipated  maximum  range-to- 
target  of  5,000  meters,  there  would  only  be  3  pix¬ 
els  on  target  for  detection.  It  is  within  this  frame¬ 
work  that  the  algorithms  had  to  be  evaluated. 

Developer  Algorithm  descriptions 

The  following  sections  describe  the  major  types  of 
evaluation  experiments  in  which  performers  par¬ 
ticipated,  the  technical  approach,  and  the  partition 
of  mission  requirements  addressed. 

Color  CCD  detection 

Only  one  performer  produced  a  color  CCD  detec¬ 
tion  algorithm  for  UGV/RSTA.  This  algorithm 
was  developed  by  the  Colorado  State  University/ 
University  of  Massachusetts  team.  The  algorithm 
identifies  targets  based  on  the  differences  in  hue 
between  targets  and  background.  Since  the  sensor 
operates  in  the  visible  portion  of  the  spectrum, 
detections  using  this  algorithm  could  be  carried 
out  during  the  day  only,  even  though  mission  per¬ 
formance  requirements  specify  both  day  and  night 
detection.  Thus  a  major  mission  performance 
shortfall  was  identified  prior  to  any  “actual” 
evaluations. 

FILR  detection 

Several  performers  developed  algorithms  using 
FLIR  imagery  to  perform  detection  including 
Lockheed-Martin  Marietta,  Hughes,  and  Honey¬ 
well.  Their  FLIR  algorithms  were  designed  to 
take  as  input  a  full-frame,  wide-field-of-view  im¬ 
age.  The  algorithms  then  detected  potential  targets 
and  passed  the  locations  of  these  targets  and  small 
image  chips  containing  these  targets  to  the  next 
processing  stage.  Currently,  only  the  Hughes  al¬ 
gorithm  has  been  evaluated. 

Color  CCD  Evaluation 

Objectives 

There  were  two  primary  objectives  for  the  evalua¬ 
tion  of  the  University  of  Massachusetts 
(UMASS)/Colorado  State  University  CCD  detec¬ 
tion  algorithm:  (1)  to  determine  the  range  of  user 


parameters  where  the  algorithm  exceeded  the  user 
requirements  and  (2)  to  determine  the  perform¬ 
ance  of  the  algorithm  over  the  selected  data  set  in 
terms  probability  of  detection  and  false  alarm  rate. 
Ancillary  objectives  were  to  provide  an  analysis 
of  performance  to  the  program  manager  and  the 
developer  to  support  identification  of  specific 
problems  with  either  the  algorithm  or  the  data. 

Implementation  details 

The  evaluation  of  the  UMASS/Colorado  State 
University  color  detection  algorithm  was  per¬ 
formed  at  the  University  of  Massachusetts  under 
the  supervision  of  the  evaluation  team.  University 
of  Massachusetts  staff  supporting  the  conduct  of 
the  evaluation  had  no  previous  knowledge  of  the 
specific  images  selected  for  the  evaluation.  The 
evaluation  design  matrix  contains  both  seques¬ 
tered  and  non-sequestered  imagery. 

The  sequestered  imagery  is  from  the  Martin  Mari¬ 
etta  September  94  collection  (Rimey,  1995)  and 
consists  of  approximately  111  unique  image 
scenes.  There  were  199  target  opportunities  in  the 
unique  image  scenes.  Each  unique  image  scene  in 
this  data  set  had  10  or  more  replicate  images^ 
Therefore,  in  the  sequestered  evaluation  sample, 
there  were  1990  target  opportunities. 

The  entire  Ft.  Carson  set  (Beveridge,  1994)  had 
been  released  to  the  developers  early  in  the  pro¬ 
gram  for  algorithm  development  and  therefore  was 
not  sequestered.  The  Ft.  Carson  collection  con¬ 
sists  of  scanned  color  film,  with  37  unique  image 
scenes  with  56  targets.  There  were  no  replicate 
images  for  this  collection. 

The  non-sequestered  Ft.  Carson  data  set  was  used 
in  the  evaluation  because  it  contains  image  acqui¬ 
sition  conditions  significantly  different  from  those 
in  the  Martin  Marietta  collection.  It  therefore  of¬ 
fered  an  opportunity  for  conducting  an  evaluation 
over  conditions  closer  to  the  mission  requirements 
than  would  have  been  possible  had  we  used  only 

^  Replicate  images  are  “duplicates”  of  the  unique  image 
scenes,  captured  at  the  frame  rate  of  the  camera  (e.g.,  1/30 
sec).  For  this  collection  10  replicate  images  were  captured 
for  each  unique  image  scene.  The  images  should  be  Identical 
except  for  variations  caused  by  system  noise  and  camera  vi¬ 
bration. 


the  sequestered  images  from  the  Martin  Marietta 
94  collection.  We  believed  that  this  opportunity 
for  breadth  in  evaluation  outweighed  the  fact  that 
developers  had  had  the  opportunity  to  train  on  the 
test  set  “  not  usually  a  desirable  evaluation  condi¬ 
tion. 

One  significant  difference  between  the  Ft.  Carson 
and  the  Martin  Marietta  collections  is  that  the  Ft. 
Carson  data  was  obtained  from  digitally  scanning 
film  images,  while  the  Martin  Marietta  collection 
captured  the  images  directly  from  a  digital  CCD 
camera.  The  scanned  film  images  were  processed 
and  reduced  in  resolution  by  a  factor  of  4  using 
the  Kodak  Photo  CD  process  (this  process  first 
spatially  low  passes  the  image  before  sub¬ 
sampling  to  reduce  aliasing)  to  490  by  760  pixels. 
The  film  images  had  variations  in  exposure  and 
the  best  exposure  images  available  were  chosen 
for  the  evaluation.  The  CCD  images  were  col¬ 
lected  with  an  automatic  “white  balance”  control 
operating.  This  control  normalizes  the  gain  of 
each  color  to  provide  an  overall  “best  exposure.” 
However  this  “white  balance,”  because  it  adjusts 
each  color  on  an  image-by-image  basis  based  on 
the  scene  content,  destroys  the  absolute  color  in¬ 
formation.  In  images  where  there  is  a  large 
amount  of  sky  in  the  image,  there  is  a  noticeable 
color  shift  in  the  background. 

No  additional  development  on  the  version  of  the 
algorithm  used  for  evaluation  was  conducted  by 
the  UMASS/Colorado  State  University  after  No¬ 
vember  1995.  For  the  evaluation,  the  algorithm 
was  to  have  two  sets  of  operating  characteristics, 
one  developed  for  each  of  the  two  collections. 
The  algorithm  would  use  one  set  of  parameters  for 
all  of  the  acquisitions  from  each  collection.  In 
this  manner,  the  evaluation  would  characterize  the 
variance  in  the  algorithm’s  ability  to  perform  over 
different  days  and  over  different  scenes  within  a 
restricted  time  and  location  (6  days  at  Ft.  Carson, 
4  days  at  the  Martin  Marietta  collections).  Infor¬ 
mation  was  passed  to  the  algorithm  that  identified 
the  collection  set,  but  no  additional  information, 
(e.g.,  illumination  conditions,  color  normalization, 
target  type,  or  range-to-target)  was  used  by  the 
algorithm. 


Conditions  of  evaluation 
Target  set 

For  the  best  evaluation  results,  targets  should  be 
equally  distributed  over  the  target  types  and  target 
orientation.  (The  design  assumes  a  stratified  sam¬ 
ple).  This  was  not  the  case  for  this  sample,  even 
after  both  collections  were  combined.  The  distri¬ 
bution  of  targets  and  their  orientations  is  not  uni¬ 
form  -  almost  half  as  many  targets  are  oriented 
front/rear  as  are  diagonal  or  broadside.  Thus,  for 
any  given  range,  there  will  be  more  POT  for  the 
set  of  targets  than  if  the  poses  had  been  uniformly 
distributed.  This  should  make  detection  easier 
than  it  would  be  in  an  entirely  random  selection. 
In  addition  there  are  too  few  targets  (by  a  factor  of 
two)  of  the  M543A2  and  Ml  13-901  type  as  there 
should  be  (based  on  a  uniform  distribution  of  tar¬ 
gets). 

Image  set 

A  summary  table  of  the  image  data  set  character¬ 
istics  that  were  chosen  for  evaluation  are  in  Figure 
1.  The  pixels  on  target  for  the  Ft.  Carson  collec¬ 
tion  ranged  from  500  to  4000,  while  the  range  for 
the  Martin  Marietta  collection  was  30  to  4000.  In 
addition  the  Ft.  Carson  collection  was  done  at 
very  low  ranges  to  target  (50  to  180  m)  while  the 
Martin  Marietta  collection  was  at  substantially 
greater  ranges  to  target  (500  to  1600m).  The 
range-to-target  is  significantly  greater  in  the 
Martin  Marietta  collection  and  much  more 
comparable  to  the  expected  operating  condi¬ 
tions. 

Though  the  maximum  pixels  on  target  is  approxi¬ 
mately  the  same  for  each  collection  (4000),  the 
images  in  the  Martin  Marietta  collection  have  a 
much  greater  range  of  pixels  on  target  available, 
mainly  as  a  result  of  using  an  additional  wide 
field-of-view  lens.  Any  atmospheric  effects  that 
would  effect  the  color  would  more  than  likely  be 
seen  in  this  image  set. 

Table  1  shows  both  the  image-science  oriented 
and  the  user-oriented  (i.e.,  subjective  scale  data) 
descriptive  data  on  the  images.  All  of  the  imagery 
was  rated  as  having  high  signal-to-noise  (no  noise 
visible  in  the  images).  All  of  the  images  in  the  Ft. 
Carson  imagery  were  judged  to  have  high  visibil- 
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ity  and  low  environmental  clutter,  while  the  Mar¬ 
tin  Marietta  collection  images  were  judged  to  vary 
substantially  on  both  dimensions. 

Evaluation  parameters  and  ranges 

Target  scene  and  sensor 

The  targets  that  were  used  in  this  evaluation  were 
painted  with  standard  army  camouflage  paint.  All  of 
the  targets  were  stationary  during  imaging.  The  il¬ 
lumination  conditions  during  the  collections  were 
day  only  and  the  weather  was  generally  clear. 

The  background  was  generally  treeless,  green  grassy 
fields  with  slow  rolling  hills  at  Ft.  Carson.  There 
were  very  few  bushes  and  mostly  unstructured  back¬ 
grounds.  Some  of  the  imagery  had  mountains  in  the 
distant  background,  but  in  most  of  the  imagery  the 
horizon  was  close  to  the  camera.  Almost  all  of  the 
images  include  significant  portions  of  sky. 

The  backgrounds  available  at  Martin  Marietta  are 
much  more  complicated  than  at  Ft.  Carson.  Most 
of  the  imagery  is  of  distant  rolling  brown  grass¬ 
lands,  but  there  are  a  number  of  clumps  of  trees 
and  even  some  large  green  bushes.  There  are  im¬ 
ages  with  very  structured,  large  rock  outcrops. 
There  are  some  dirt  roads  and  some  buildings.  A 
number  of  the  narrow  field  of  view  images  do  not 


contain  any  portion  of  the  sky. 

The  limited  conditions  of  imagery  collection  pre¬ 
cludes  performance  evaluations  that  are  consistent 
with  many  mission  requirements. 

Scoring  rubrics 

Detections  were  scored  on  the  basis  of  consistency 
with  the  available  “ground  truth.”  Ground  truth 
on  the  images  was  produced  by  manually  drawing 
a  bounding  rectangle  around  each  true  target  on 
each  image.  No  independent  ground  truth,  based 
on  actual  conditions  on  the  ground  was  available. 

Targets  were  counted  as  detected  if  the  center  of 
the  target  report  fell  within  the  ground  truth 
bounding  rectangle.  True  targets  that  are  in  the 
image,  but  are  not  in  the  evaluation  set  (e.g.,  the 
pick-up  truck)  were  not  counted  as  either  detec¬ 
tions  or  misses  or  false  alarms.  Targets  that  were 
within  10  pixels  of  the  edge  of  the  frame,  or  par¬ 
tially  off  the  frame  were  not  counted.  Multiple 
detections  on  the  same  target  are  reported  as  a 
single  true  detection.  The  number  of  multiple  de¬ 
tections,  however,  is  reported. 

False  Alarm  Rates  (FAR)  were  determined  by 
counting  all  of  the  reported  detections  that  are  not 
true  detections.  This  FAR  was  calculated  on  a 
frame  by  frame  basis.  True  targets  that  are  de- 


Table  1  Color  evaluation  imagery  characteristics 


Ft.  Carson 

Martin  Marietta  September 

Description  of  imagery 

Scanned  color  film 

CCD  camera  auto  “white  bal¬ 
anced” 

Number  of  replicates 

Single  replicate 

10  replicates/scene 

Number  of  unique  images 

37 

110 

Number  of  targets 

56 

199 

POT 

4000  to  500 

4000  to  30 

Range-to-target 

50-180  meters 

500  to  1600  meters 

Target  visibility 

100%  high  visibility 

74%  high,  19%  medium,  and 
7%  low  visibility 

Obstruction 

54%  none,  43%  minor,  5% 
major 

46%  none,  38%  minor,  17% 
major 

Visible  signal-to-noise 

100%  high 

100%  high 

Environmental  clutter 

100%  low  clutter 

20%  low,  48%  medium,  and 
32%  high  clutter 

Sequestered/Not-sequestered 

not  sequestered 

sequestered 
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tected  were  not  be  counted  in  the  FAR.  Multiple 
detections  within  a  single  “true  target”  bounding 
rectangle  were  not  counted  in  the  FAR. 

A  second  analysis  was  performed  treating  the 
pick-up  truck  as  a  true  target,  even  though  it  was 
not  in  the  original  list  of  targets.  This  was  done 
because  it  was  a  camouflaged  military  pick-up 
truck,  and  it  seemed  reasonable  to  look  at  the  re¬ 
sults  in  this  framework  as  well. 

To  develop  performance  requirements,  we  worked 
with  the  University  of  Texas  at  Arlington  using 
their  Scenario  Based  Engineering  Process,  and 
with  the  algorithm  developers.  We  were  unable  to 
determine  any  specific  requirements  for  the  prob¬ 
ability  of  detection  (Pj)  or  the  False  alarm  Rate 
(FAR),  but  we  were  able  to  come  to  a  consensus 
based  on  reasonability  of  Pd>=  0.9  and  less  than  1 
false  alarm  per  60  frames  as  performance  goals. 

Algorithm  training 

The  instructions  that  the  UMASS  staff  received 
on  training  their  algorithm  were  consistent  with 
the  expected  operational  environment.  The  algo¬ 
rithm  was  to  be  trained  in  a  manner  that  would 
allow  separate  parameter  setups  to  be  used  for 
each  collection.  UMASS  staff  were  allowed  to 
select  the  specific  samples  to  train  on,  including 
determining  the  content,  remge  of  parameters  rep¬ 
resented,  and  ultimately  the  number  of  images 
trained.  This  approach  was  used  in  order  to  allow 
the  developers  the  optimum  flexibility  in  selecting 
an  operating  point  that  could  trade  off  Probability 
of  Detection  (Pj)  and  FAR. 

UMASS  staff  successfully  trained  their  algorithm 
on  the  Ft.  Carson  imagery,  but  were  unable  to 
train  their  algorithm  on  the  Martin  Marietta  col¬ 
lection  imagery.  They  identified  two  problems 
with  the  Martin  Marietta  collection  imagery  that 
prevented  them  from  successfully  training  the  al¬ 
gorithm.  In  the  first  image  of  the  ten  image  repli¬ 
cate  set,  there  was  a  shift  in  the  mean  response  in 
the  blue  band  of  18  counts  (c.f.  Table  2)  which 
translates  to  almost  180°  hue  shift.  The  second 
problem  that  UMASS  staff  identified  is  that  be¬ 
tween  images,  the  hue  of  the  background  under¬ 
goes  large  shifts.  Since  the  algorithm  uses  the 
difference  in  hue  between  the  targets  and  the 


background,  these  hue  shifts  made  the  algorithm 
un-trainable  for  the  Martin  Marietta  collections. 


Table  2  Mean  band  response  for  two  Martin  Marietta 
collection  images 


Mean 

Responses 

Red 

Response 

Green 

Response 

Blue 

Response 

Frame  1 

167.9 

167.2 

142.1 

Frame  2 

166.8 

165.0 

170.5 

The  problems  experienced  by  the  UMASS  staff  in 
training  on  the  Martin  Marietta  color  images  are 
believed  to  be  the  result  of  the  following  two  cir¬ 
cumstances: 

1 .  The  CCD  camera  used  in  the  Martin  Marietta 
collection  was  operated  with  the  automatic 
white  balance  control  functioning.  This  re¬ 
sulted  a  loss  in  absolute  color  across  images, 
since  the  color  is  adjusted  to  the  overall  color 
of  each  scene.  Since  the  UMASS/Colorado 
State  University  algorithm  is  based  on  the 
absolute  colors  of  both  targets  and  back¬ 
ground,  the  image  to  image  variation  in  the 
CCD  images  caused  serious  problems  with 
algorithm  performance. 

2.  The  background  vegetation  in  the  Martin 
Marietta  collection  was  gray/brown,  resulting 
in  an  indeterminate  hue.  Since  the  algorithm 
makes  distinctions  based  on  color  hue,  the 
indeterminacy  of  the  hue  caused  serious 
problems  with  algorithm  performance. 

We  were  unable  to  correct  for  these  effects. 
Therefore,  although  we  had  planned  to  use  both 
the  Martin  Marietta  and  the  Ft.  Carson  imagery 
sets,  the  evaluation  of  the  UMASS/Colorado  State 
University  algorithm  was  actually  performed  us¬ 
ing  the  Ft.  Carson  imagery  only. 

Evaluation  imagery  selection  and 
specification 

The  UMASS/Colorado  State  University  algorithm 
was  tested  on  37  images  from  the  Ft.  Carson  col¬ 
lection.  The  major  characteristics  with  respect  to 
POT  of  the  56  targets  contained  in  the  37  images 
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Table  3  Summary  target  information  for  color  test  imagery 


Tarqet 

Pose 

POT(>2000) 

POT  (1000-2000) 

POT  (250-1000) 

Grand  Total 

M60 

Front 

0 

4 

0 

4 

Diag 

0 

3 

2 

5 

Broad 

8 

1 

0 

9 

M60  Total 

8 

8 

2 

18 

M113 

Diag 

4 

3 

1 

8 

Broad 

0 

2 

9 

11 

Ml  13  Total 

4 

5 

10 

19 

Mil  3-901 

Front 

0 

0 

1 

1 

Diag 

0 

3 

1 

4 

Broad 

2 

9 

3 

14 

Mil  3-901  Total 

2 

12 

5 

19 

Grand  Total 

14 

25 

17 

56 

are  shown  in  Table  3.  There  are  only  three  target 
types,  and  one  of  them,  the  Ml  13-901,  is  a  minor 
modification  of  another  target  type.  There  are 
essentially  no  images  with  the  Ml  13  oriented 
head-on  to  the  sensor.  All  images  contain  at  least 
one  target.  Most  importantly,  there  are  no  targets 
with  fewer  than  250  pixels  on  target.  The  limited 
amount  of  useable  evaluation  imagery  available 
seriously  compromised  the  utility  of  this  evalua¬ 
tion,  particularly  in  terms  of  end-user  mission  ob¬ 
jectives. 

Results 

UMASS  staff  ran  their  algorithm  on  37  images 
and  reported  the  bounding  rectangles  for  each  de¬ 
tection.  Targets  with  bounding  rectangles  that 
were  within  10  pixels  of  the  edge  of  the  frame,  or 


Performance  results  are  displayed  in  Table  4.  The 
Pd  (87.4%  +/-  10%  95%  Confidence  Interval)  nOt  differ¬ 
ent  from  the  end-user  performance  goals  of  90% 
Pd,  at  statistically  significant  levels.  However,  this 
is  partially  a  result  of  the  relatively  small  sample 
size  and  the  resulting  large  confidence  interval.  It 
is  possible  that  the  algorithm  parameters  could  be 
tuned  to  increase  the  detection  rate,  but  at  the  ex¬ 
pense  of  increasing  the  false  alarm  rate. 

There  was  a  substantial  false  alarm  rate.  On  aver¬ 
age  there  were  6.7  false  alarms  per  image  with 
95%  of  the  images  having  between  0  and  18  false 
alarms  per  image 

Almost  all  of  the  images  (33  out  of  37)  had  at 
least  one  false  alarm  and  four  images  had  15  or 
more  false  alarms.  The  average  false  alarm  rate  is 


Table  4  Color  algorithm  evaluation  summary  results 


Probability  of 
Detection,  per  target 

False  Alarm  Rate, 
per  frame 

Performance  Goals 

90% 

0.0166 

Algorithm  performance 

87.4%  (+/-  10%) 

6.7  (+/- 1.8) 

partially  off  the  frame  were  not  counted.  On  two  400  times  greater  than  the  end-  performance  goals 

occasions  multiple  detections  on  the  same  target  of  one  false  alarm  per  60  frames. 

were  counted  as  a  single  true  detection.  ,  , 

The  impact  of  the  false  alarm  rate  depends  on  the 

details  of  the  operational  scenario.  If  only  the 
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color  detection  algorithm  were  to  be  used,  it  is 
probable  that  detected  targets  would  be  chipped 
out  and  transmitted  to  a  base  station  for  display 
and  evtiluation  by  a  human  observer.  In  this  sce¬ 
nario,  the  algorithm  would  function  as  an  image 
pre-screener  and  missed  targets  would  have  a 
substantial  impact  on  utility.  The  false  alarm  rate 
would  impact  the  operational  performance  in  both 
time  and  bandwidth  requirements  to  send  back 
each  detected  target  and  enough  surround  for 
context.  Given  the  limited  bandwidth  of  the  link 
to  the  ground-station,  the  impact  of  the  large  num¬ 
ber  of  false  alarms  is  substantial  in  the  amount  of 
time  that  the  scout  must  remain  exposed. 


Table  5  Summary  of  color  false  alarm  types 


False  Alarm  Type 

Percent 

Target  like 

6.8% 

On  horizon 

18.9% 

On  bushes  on  horizon 

12.8% 

On  near  field  bushes 

11.5% 

In  sky 

28% 

Other 

22% 

Six  types  of  false  alarms  were  identified  in  the 
imagery  (see  Table  5).  A  camouflage-painted 
pick-up  truck  was  in  14  images  and  was  counted 
in  the  first  analysis  as  a  false  alarm  since  it  was 
not  in  the  original  target  set.  If  one  were  to  con¬ 
sider  the  camouflage-painted  truck  that  was  im¬ 
aged  as  a  target,  rather  than  a  false  alarm,  there  is 
no  statistically  significant  change  in  the  detection 
and  false  alarm  rates.  The  probability  of  detection 
increases  from  87%  to  90%  (+/-  10%)  and  the 
false  alarm  rate  drops  from  6.7  to  6.1  {+!-  1.8)  per 
frame. 

Almost  32%  (18.9%  +  12.8%)  of  the  false  alarms 
occurred  along  the  line  of  the  horizon,  either 
where  a  tree  or  bush  protruded  above  the  horizon 
or  just  adjacent  to  the  horizon  line  itself.  Detec¬ 
tions  within  bushes  and  other  scrub  that  were  in 
the  near  field  of  view,  close  to  the  camera  contrib¬ 


uted  11.5%  of  the  false  alarms.  Detections  in  the 
sky,  often  five  or  ten  within  a  single  image,  con¬ 
tributed  28%  of  the  false  alarms.  The  rest  of  the 
false  alarms  came  from  a  variety  of  undifferentiat- 
able  causes. 

Even  though  algorithm  detections  that  were  within 
10  pixels  of  the  edge  of  the  image  frame  were  re¬ 
moved  from  the  analysis,  there  were  a  significant 
number  of  these  detections  (3.4  (0,  11)  95%  de¬ 
tections  per  frame).  The  implications  in  an  opera¬ 
tional  system  are  an  increase  in  algorithm  com¬ 
plexity  and  processing  time  to  extract  the  reduced 
size  image.  A  more  serious  impact  is  the  4%  re¬ 
duction  in  effective  image  size  that  would  require 
additional  image-to-image  overlap  when  creating 
mosaics. 

FLIR  Evaluation 
Objectives 

There  were  three  primary  objectives  for  the 
evaluation  of  the  Hughes  FLIR  detection  algo¬ 
rithm:  (1)  to  determine  the  range  of  user  parame¬ 
ters  where  the  detection  algorithm  met  or  failed  to 
meet  the  user  requirements,  (2)  to  determine  the 
performance  of  the  algorithm  over  the  selected 
data  set  in  terms  of  probability  of  detection  and 
false  alarm  rate,  and  (3)  to  evaluate  the  hypothesis 
that  an  adaptive  algorithmic  approach  is  superior 
to  a  non-adaptive  approach. 

Implementation  details 

The  evaluation  of  the  Hughes  FLIR  detection  al¬ 
gorithm  was  performed  at  NVESD  under  the  su¬ 
pervision  of  the  evaluation  team.  The  evaluation 
imagery  was  selected  as  described  previously  and 
the  specific  images  selected  were  not  known  to  the 
Hughes  development  team  until  the  completion  of 
the  evaluation.  The  initial  evaluation,  which  is  all 
that  is  reported  here,  includes  imagery  from  the  Ft. 
Carson  collection  only.  Images  from  the  Ft.  A.P. 
Hill  and  Martin  Marietta  September  collections 
are  currently  being  processed.  The  evaluation 
design  matrix  consists  of  84  targets  contained  in 
58  unique  images  collected  over  six  days.  All  of 
these  images  were  available  to  the  developer. 
There  were  no  sequestered  or  replicate  images 
used  in  this  evaluation. 
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No  additional  development  on  the  version  of  the 
algorithm  used  for  evaluation  was  conducted  by 
Hughes  after  September  1996.  Hughes  developed 
two  sets  of  parameters  specifically  for  the  Ft.  Car- 
son  collection.  One  parameter  set  is  for  a  non- 
adaptive  algorithm,  while  the  second  set  is  for  an 
adaptive  algorithm.  Information  was  passed  to  the 
algorithm  that  identified  the  collection  set,  but  no 
additional  information,  (e.g.,  illumination  condi¬ 
tions,  time  of  day,  target  type,  or  range-to-target) 
was  used  by  the  algorithm.  It  is  anticipated  that 
the  continuing  testing  will  utilize  four  additional 
parameter  sets,  two  for  each  of  the  two  other  col¬ 
lections. 

Conditions  of  evaluation 
Target  set 

As  discussed  in  the  color  evaluation  section,  it  is 
desirable  to  have  a  uniform  distribution  of 
samples  in  each  sample  strata  factors.  However, 
the  distribution  of  targets  and  orientations  for  this 
evaluation  is  not  uniform  -  almost  one-third  as 
many  targets  are  oriented  front/rear  as  are 
diagonal  or  broadside.  For  this  target  set,  at  a 
specific  range,  the  number  of  pixels  on  target  is 
greater  for  broadside  and  diagonal  orientations 
than  for  the  front/rear  orientation.  The 
distribution  over  the  three  target  types  is  roughly 
uniform;  however,  there  are  fewer  Ml  13-901 
oriented  broadside,  and  more  oriented  diagonally, 
than  one  would  expect. 

The  original  evaluation  memorandum  of  under¬ 
standing  only  considered  three  vehicles  as  “true” 
targets:  the  M60,  Ml  13,  and  Ml  13-901.  In  the  Ft. 
Carson  collection  there  are  a  number  of  images  of 
a  camouflaged  painted  pickup  truck.  Two  sepa¬ 
rate  analysis  were  performed,  the  first  used  only 
the  targets  initially  agreed  to  as  targets,  the  second 
included  the  pickup  truck  in  the  target  set.  The 
pickup  truck  provides  an  additional  24  targets, 
mostly  in  a  diagonal  orientation  (2  front/rear,  16 
diagonal,  and  6  broadside). 

Image  set 

A  summary  table  of  the  image  data  set  characteristics 
chosen  for  this  evaluation  are  shown  in  Table  6.  The 
pixels  on  target  for  the  FLIR  Ft.  Carson  collection 
ranged  from  150  to  1900,  significantly  less  than  the 
POT  used  in  UMASS  evaluation  (500  to  4000).  The 


RTT  is  the  same  as  used  in  the  color  evaluation  and 
as  such  is  significantly  fewer  than  the  operational 
requirements.  This  implies  that  the  atmospheric  ef¬ 
fects  are  not  consistent  with  what  would  be  expected 
operationally,  even  in  those  cases  where  the  POTs 
are  appropriate. 

Table  6  shows  both  the  image-science  oriented 
and  the  user-oriented  (i.e.,  subjective  scale  data) 
descriptive  data  on  the  images.  The  FLIR  imagery 
was  rated  as  generally  medium  visibility  (44%), 
that  is  between  25%  and  75%  of  the  boundary  of 
the  targets  were  visible.  The  imagery  was  judged 
to  have  high  Visible-Signal-to-Noise  (VSNR) 
(84%)  but  there  are  a  number  of  images  with 
VSNR  in  the  medium  and  low  range.  About  half 
(45%)  of  the  targets  are  obstructed. 

Evaluation  parameters  and  ranges 

Target  scene  and  sensor 

The  targets  that  were  used  in  this  evaluation  were 
painted  with  standard  army  camouflage  paint.  All 
of  the  targets  were  stationary  during  imaging.  The 
illumination  conditions  during  the  collections 
were  mostly  daytime  with  only  three  images  ac¬ 
quired  at  night.  The  weather  was  generally  clear. 
The  background  was  generally  treeless  with  green 
grassy  fields  and  slow  rolling  hills.  There  were 
very  few  bushes  and  mostly  unstructured  back¬ 
grounds.  Some  of  the  imagery  had  mountains  in 
the  distant  background,  but  in  most  of  the  imagery 
the  horizon  was  close  to  the  camera.  Almost  all  of 
the  images  include  significant  portions  of  sky.  In 
most  cases  the  target  engines  were  running  and  the 
target  bodies  were  hot. 

As  with  the  color  evaluation,  the  limited  condi¬ 
tions  of  imagery  collection  precludes  performance 
evaluations  that  are  consistent  with  many  mission 
requirements. 

Scoring  rubrics 

The  scoring  rubrics  used  for  the  FLIR  evaluation 
were  the  same  as  used  in  the  color  evaluation. 
Detections  were  scored  on  the  basis  of  consistency 
with  the  available  “ground  truth.”  Ground  truth 
on  the  images  was  produced  by  manually  drawing 
a  bounding  rectangle  around  each  true  target  on 
each  image.  No  independent  ground  truth,  based 
on  actual  conditions  on  the  ground,  was  available. 
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Table  6  FLIR  evaluation  imagery  characteristics 


Ft.  Carson  FLIR  Imagery 

Description  of  imagery 

Amber  FLIR 

Number  of  replicates 

Single  replicate 

Number  of  images 

58 

Number  of  targets 

84  (109  including  pickup 

truck) 

POT 

1900  to  150 

Range-to-target  (RTT) 

50-180  meters 

Target  visibility 

27%  high,  44%  medium  ,  29% 
low  visibility 

Visible  signal-to-noise 

(VSNR) 

84%  high,  15%  medium,  1% 
low  signal-to-noise  (S/N) 

Obstructed 

55%  none,  30%  minor,  15% 
major  obstruction 

Environmental  clutter 

1%  high,  65%  medium,  34% 
low  clutter 

Sequestered/ 

Non-sequestered 

not  sequestered 

Targets  were  counted  as  detected  if  the  center  of 
the  target  report  fell  within  the  ground  truth 
bounding  rectangle.  True  targets  that  are  in  the 
image,  but  are  not  in  the  evaluation  set  (e.g.,  the 
pick-up  truck)  were  not  counted  as  either  detec¬ 
tions  or  misses  or  false  alarms.  Targets  that  were 
within  10  pixels  of  the  edge  of  the  frame,  or  par¬ 
tially  off  the  frame  were  not  counted.  Multiple 
detections  on  the  same  target  are  reported  as  a 
single  true  detection.  The  number  of  multiple  de¬ 
tections  is  reported. 

False  Alarm  Rates  (FAR)  were  determined  by 
counting  all  of  the  reported  detections  that  were 
not  true  detections.  The  FAR  was  calculated  on  a 
frame  by  frame  basis.  True  targets  that  are  de¬ 
tected  were  not  counted  in  the  FAR.  Multiple 
detections  within  a  single  “true  target”  bounding 
rectangle  were  not  counted  in  the  FAR. 


A  second  analysis  was  performed  treating  the 
pick-up  truck  as  a  true  target,  even  though  it  was 
not  in  the  original  list  of  targets.  This  was  done 
because  it  was  a  camouflaged  military  pick-up 
truck  and  it  seemed  reasonable  to  look  at  the  re¬ 
sults  in  this  framework  as  well. 

Algorithm  training 

The  instructions  that  the  Hughes  development 
team  were  given  on  training  their  algorithm  were 
consistent  with  the  expected  operational  environ¬ 
ment.  The  algorithm  was  to  be  trained  in  a  man¬ 
ner  that  would  allow  separate  parameter  set-ups  to 
be  used  for  each  collection.  Hughes  was  allowed 
to  select  the  specific  samples  to  train  on,  including 
determining  the  content,  range  of  parameters  rep¬ 
resented,  and  ultimately  the  number  of  images 
trained.  This  approach  was  chosen  to  allow  the 
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developers  the  optimum  flexibility  in  selecting  an 
operating  point  that  could  trade  off  Pj  and  FAR. 

Hughes  successfully  trained  their  algorithm  on  the 
Ft.  Carson  imagery  and  delivered  two  sets  of  pa¬ 
rameters,  one  non-adaptive  and  one  adaptive,  to 
NVESD  for  testing.  Hughes  did  not  choose  to 
develop  separate  parameters  for  day  and  night  im¬ 
aging;  however,  they  did  choose  to  develop  sepa¬ 
rate  parameter  sets  for  each  collection  (only  the 
Ft.  Carson  set  was  tested  at  the  same  time  as  pub¬ 
lication). 

Evaluation  imagery  selection  and  specification 
The  imagery  that  was  used  for  the  Hughes  detec¬ 
tion  evaluation  was  the  entire  set  of  Ft.  Carson 
imagery  that  was  selected,  as  described  in  Table  6. 
No  images  were  set  aside,  and  the  entire  evalua¬ 
tion  was  performed  in  one  algorithm  execution 
session. 

The  Hughes  FLIR  detection  algorithm  was  tested 
on  58  unique  images,  containing  84  targets.  The 
majority  (84%)  of  the  targets  have  POT  between 
250  and  1000.  This  is  significantly  fewer  than  the 
POT  used  in  the  color  evaluation  (c.f.  Table  1). 
As  with  the  color  evaluation,  the  limited  amount 
of  useable  evaluation  imagery  available  seriously 
compromised  the  utility  of  this  evaluation,  par¬ 
ticularly  in  terms  of  end-user  mission  objectives. 

The  algorithm  scoring  was  provided  by  the 
NVESD  AUTOSPEC  program  (Walters,  1990) 
which  automatically  scores  detections,  misses,  and 
false  alarms  using  the  rules  detailed  in  the  scoring 
rubrics. 


Results 

NVESD  ran  the  Hughes  algorithm  twice  on  all  of 
the  58  Ft.  Carson  images  and  scored  the  results. 
The  Hughes  algorithm  only  reported  a  single  point 
(centroid)  for  each  detection.  There  were  no  tar¬ 
gets  reported  within  10  pixels  of  the  edge  of  the 
frame  and  there  were  no  multiple  detections  on  the 
same  target. 

The  performance  results  are  displayed  in  Table  7. 
The  Pd  for  both  the  non-adaptive  (59%  +!-  10% 

95%  Confidence  Interval)  and  adaptive  (49%  +/-  10%  95% 
Confidence  Interval)  cases  are  different  from  the  end-user 
performance  goals  of  90%  Pa  at  statistically  sig¬ 
nificant  levels.  The  adaptive  Pd  is  not  different 
than  the  non-adaptive  Pd,  at  statistically  significant 
levels.  However,  this  is  partially  a  result  of  the 
relatively  small  sample  size  and  the  resulting  large 
confidence  interval. 

There  was  a  substantial  false  alarm  rate.  On  aver¬ 
age  there  were  4.7  false  alarms  per  image  with  the 
non-adaptive  algorithm.  The  false  alarm  rate  de¬ 
creased,  at  statistically  significant  levels,  to  1.4 
false  alarms  per  image,  when  using  the  adaptive 
algorithm.  The  false  alarm  rate  for  the  adaptive 
algorithm  is  84  times  the  end-user  performance 
goal  of  one  false  alarm  per  60  frames. 

The  use  of  the  adaptive  FLIR  algorithm  improves 
the  FAR,  at  the  expense  of  the  Pj.  With  a  fixed 
decision  process,  when  the  decision  threshold  is 
changed,  both  the  Pd  and  FAR  change  along  a 
fixed  Receiver  Operator  Characteristic  (ROC) 
curve.  This  ROC  fully  determines  the  algorithm’s 


Table  7  FLIR  algorithm  evaluation  summary  results 


Probability  of  Detection 
(Pj),  per  target 

False  Alarm  Rate 
(FAR),  per  frame 

Performance  Goals 

90% 

0.0166 

Algorithm  performance 

non-adaptive 

59%  (-l-/-10%) 

4.7  {+!-  0.2) 

Algorithm  performance 

adaptive 

49%  (+/-  10%) 

1.4  (-I-/-0.4) 
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performance  with  respect  to  Pd  and  FAR. 

Two  algorithms  are  different  in  performance,  if 
and  only  if  their  ROC’s  are  different.  Therefore  it 
is  important  to  determine  if  the  measurements  of 
Pd  and  FAR  for  these  two  algorithms  lie  on  two 
different  ROC  curves.  Unfortunately,  we  do  not 
have  enough  data  from  this  evaluation  to  answer 
this  question.  Additional  information  that  is  re¬ 
quired  to  calculate  the  ROC  curves  is  the  confi¬ 
dence  in  each  detection.  This  information  is  the 
intermediate  data  before  a  decision  threshold  is 
applied.  With  the  confidence  information,  and 
additional  data  points  to  increase  the  measurement 
precision,  one  would  be  able  to  determine  if  the 
two  algorithms  were  the  same  decision  process, 
with  different  thresholds,  or  were  two,  differently 
performing,  algorithms.  The  number  of  data 
points  that  would  be  necessary  to  measure  a  sta¬ 
tistically  significant  difference  of  10%  in  Pj  (e.g., 
90%  and  80%)  at  the  same  FAR,  would  be  ap¬ 
proximately  350  (as  compared  to  the  86  used). 

Comparison  bettveen  color  and  FLIR  evalua¬ 
tion  results 


Both  the  color  and  FLIR  evaluations  were  per¬ 
formed  on  imagery  that  was  taken  at  the  same 
collection.  Even  though  there  are  obvious  differ¬ 
ences  in  the  modalities  of  these  two  evaluations,  it 
is  useful  to  look  at  both  their  differences  and  their 
similarities.  Table  8  details  the  image  character¬ 
istics  for  each  evaluation.  The  image  sizes  of  the 
two  sensors  are  very  different  with  the  color  im¬ 
agery  1.5  times  as  wide  as  the  FLIR  (40°  vs.  25°). 
The  resolution  (IFOV)  of  the  color  sensor  is  1.6 
times  as  good  as  the  FLIR  sensor  (1.6  mr.  vs.  1.0 
mr.).  The  average  POT  of  the  color  imagery  is 
almost  four  times  the  average  POT  of  the  FLIR 
imagery  (1610  vs.  485).  The  implication  is  that 
the  color  imagery  had  significantly  more  pixels  on 
target  with  which  to  make  a  detection  than  the 
FLIR. 

Table  9  compares  the  performance  of  the  algo¬ 
rithms  using  a  number  of  metrics.  The  color  de¬ 
tection  meets  the  end  user  requirements  of  a  90% 
Pd,  while  neither  the  non-adaptive  nor  the  adaptive 
FLIR  algorithm  does.  The  inclusion  of  the  pickup 
truck  in  the  target  set  does  not  change  any  of  the 
performance  results.  The  Pa  is  about  90%  for  the 


Table  8  Comparison  of  color  and  FLIR  image  characteristics 


Color  evaluation 

FLIR  evaluation 

Description  of  imagery 

Scanned  color  film 

3.5  micron  AMBER  FLIR 

Field  of  View  (FOV) 

40.4  °  (H),  28.4  °  (V) 

24.9  °  (H),  23  °  (V) 

Image  size 

768  X  512pixels 

256  X  256  pixels 

IFOV 

0.9  X  1.0  mr. 

1.7  X  1.6  mr. 

Number  of  replicates 

Single  replicate 

Single  replicate 

Number  of  unique  images 

37 

58 

Number  of  targets 

56 

86 

POT 

500  -  4000  (mean  1610) 

150  -  1900  (mean  485) 

Range-to-target 

50-180  meters  (mean  108) 

50  to  180  meters  (mean  1 14) 

Target  visibility 

100%  high  visibility 

27%  high,  44%  medium  ,  29% 
low  visibility 

Obstruction 

54%  none,  43%  minor,  5% 
major 

55%  none,  30%  minor,  15% 
major  obstmction 

Visible  signal-to-noise 

100%  high 

84%  high,  15%  medium,  1% 
low  S/N 

Environmental  clutter 

100%  low  clutter 

1%  high,  65%  medium,  34% 
low  clutter 

Sequestered/Not-sequestered 

not  sequestered 

not  sequestered 
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Table  9  Performance  comparisons  of  color  and  FLIR  evaluations 


Color  Detection 

FLIR  Detection 

Non-Adaptive 

Adaptive 

Pickup  Truck 
False  Alarm 

Pickup  Truck 
Target 

Pickup  Truck 
False  Alarm 

Pickup  Truck 
Target 

Pickup  Truck 
False  Alarm 

Pickup  Truck 
Target 

Pd 

87%  (+/-io%) 

90%(4/-io%) 

59%(4/-10%) 

63%(4/-10%) 

49%(4/-10%) 

54%(4/-10%) 

FA/Frame 

6.7(4/- 1.8) 

6.1(4/- 1.8) 

4.7  (4/-0.2) 

4.4(4/-0.2) 

1.4  (4/- 0.4) 

1.1(4/- 0.4) 

FA/Degree 

0.006 

0.005 

0.008 

0.008 

0.002 

0.002 

Targets/ Re¬ 
ported  detec¬ 
tions 

18% 

24% 

16% 

21% 

34% 

48% 

Frames  with 
out  any  FA 

11% 

11% 

0% 

0% 

29% 

55% 

Frames  >5 

FA 

62% 

62% 

66% 

52% 

7% 

7% 

color  algorithm  and  about  60%  for  the  FLIR  non- 
adaptive  and  50%  for  the  FLIR  adaptive  algo¬ 
rithm.  There  are  about  6.5  False  alarms  per  frame 
with  the  color  algorithm,  and  4.5  and  1.3  false 
alarms  per  frame  with  the  FLIR  non-adaptive  and 
adaptive  algorithms  respectively.  None  of  the  al¬ 
gorithms  meet  the  end  user  requirements  for  FAR 
of  1  per  60  images. 

Since  the  color  and  FLIR  imagery  have  different 
sizes  and  IFOVs,  it  is  useful  to  compare  their  false 
alarm  rates  on  comparable  basis.  When  the  data  is 
normalized  by  square  degrees,  one  comes  to  a  dif¬ 
ferent  conclusion  on  the  relative  false  alarm  rates 
of  the  algorithms.  The  non-adaptive  FLIR  algo¬ 
rithm  has  the  highest  FAR  (0.008),  followed  by 
the  color  (0.005),  and  then  by  the  FLIR  adaptive 
algorithm  (0.002).  These  are  the  metrics  that  an 
end-user  is  concerned  with,  since  they  relate  to  the 
expected  number  of  false  alarms  that  they  would 
see  when  they  are  performing  surveillance  on  a 
piece  of  terrain.  For  example,  viewing  a  45  de¬ 
gree  wide,  20  degree  high  area,  one  would  expect 
7  false  alarms  using  the  FLIR  Non-adaptive  algo¬ 
rithm  (45x20x0.008). 


Another  metric  of  importance  to  the  end-user  is 
how  many  detections  they  must  review  until  they 
find  a  target.  This  is  measured  by  the  percentage 
of  detected  targets  to  reported  detections.  Only 
about  20%  of  the  detections  that  the  color  and  the 
non-adaptive  FLIR  algorithms  report  are  “true” 
targets.  The  adaptive  FLIR  algorithm  does  some¬ 
what  better,  increasing  to  48%  when  the  pick  up 
truck  is  considered  as  a  target.  However,  even  at 
best,  one  half  of  the  detections  are  targets,  and  one 
misses  one  half  of  the  targets. 

The  distribution  of  the  number  of  false  alarms  per 
frame  is  not  the  same  between  the  color  and  FLIR 
algorithms.  The  FLIR  non-adaptive  algorithm 
produces  false  alarms  in  every  image,  but  the 
adaptive  algorithm  (with  the  pick-up  truck)  pro¬ 
duced  no  false  alarms  in  50%  of  the  images.  In  the 
color  algorithm  11%  of  the  images  have  at  least 
one  false  alarm.  More  importantly  is  the  distribu¬ 
tion  in  the  number  of  false  alarms  per  frame. 
About  60%  of  the  color  and  non-adaptive  FLIR 
algorithm  processed  images  have  more  than  5 
false  alarms  per  frame,  as  compared  to  the  7%  of 
the  adaptive  FLIR  processed  images.  In  the  first 
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case,  the  false  alarms  are  bursty  on  an  image  by 
image  basis.  If  there  is  one  false  alarm,  there  tend 
to  be  many  false  alarms. 

Discussion 

The  development  of  the  evaluation  framework,  the 
evaluation  design,  the  implementation  of  the 
evaluation,  and  the  evaluation  results  raise  a  num¬ 
ber  of  important  issues:  (1)  None  of  the  algorithms 
met  the  end-user  performance  requirements  devel¬ 
oped  by  the  SEP  process,  (2)  The  imagery  data 
available  for  analysis  was  deficient,  which  did  not 
permit  the  evaluation  of  algorithm  performance 
over  the  range  of  end-user  specified  performance 
conditions,  and  3)  The  algorithm  evaluation  proc¬ 
ess  did  not  collect  all  of  the  data  that  is  necessary 
to  provide  complete  analysis. 

All  of  the  algorithms  have  been  evaluated  over  a 
common  space  that  is  small  with  respect  to  user 
requirements.  Five  performance  objectives  were 
identified:  probability  of  detection,  false  alarm 
rate,  algorithm  execution  speed,  location  error, 
and  degree  of  identification.  However,  the  algo¬ 
rithms  were  evaluated  using  only  three  of  these 
performance  measures:  Pj,  FAR,  and  location  er¬ 
ror.  Even  over  this  limited  set  of  evaluation  con¬ 
ditions  and  performance  measures,  only  the  color 
algorithm  meets  the  user  goals  and  only  with  re¬ 
spect  to  Pd.  No  algorithm  has  false  alarm  rates 
that  approach  the  user  goals.  In  the  best  case,  the 
false  alarm  rate  is  66  times  the  goal.  Additionally, 
it  should  be  noted  that  the  color  algorithm  is  not 
effective  at  night.  This  level  of  performance  im¬ 
plies  that  significant  improvements  in  the  algo¬ 
rithms’  performance  will  be  required  before  they 
are  suitable  for  the  Army  scout  tasks. 

An  algorithm’s  performance,  particularly  the  Pa 
are  critically  effected  by  system  design  factors 
outside  of  the  algorithm  designer’s  control.  The 
pixels  on  target  is  a  major  factor  in  the  Pa  for  a 
given  algorithm.  POT  is  determined  mainly  by 
user  requirements  (range  to  target  and  target  type) 
and  system  design  (focal  length  and  array  pixel 
size).  In  this  evaluation,  the  combination  of  user 
requirements  and  system  design  resulted  in  images 
with  few  POT  (cf  Figure  1).  In  this  evaluation  the 
average  POT  were  low  {Table  8)  with  means  of 
1600  and  500  for  the  color  and  FLIR  imagery,  and 
minimum  POTs  of  500  and  150. 


The  ability  to  determine  the  performance  of  a 
RSTA  algorithm  is  critically  dependent  on  the 
availability  of  test  imagery  with  the  proper  char¬ 
acteristics.  There  are  two  fundamental  require¬ 
ments  for  this  test  imagery.  It  must  cover  the  re¬ 
quirements  space,  and  it  must  provide  sufficient 
number  of  independent  samples  to  support  statisti¬ 
cal  hypothesis  testing.  In  practical  terms,  the  test 
imagery  must  be  acquired  at  the  conditions  that 
the  operational  algorithm  will  experience.  There 
must  be  images  with  the  same  targets,  at  the  same 
ranges,  with  the  same  configurations  and  articula¬ 
tions,  with  the  same  backgrounds,  and  the  same 
atmospherics.  Secondly,  there  can  not  be  just  one 
image  at  a  test  condition,  but  there  must  be  multi¬ 
ple  independent  samples. 

The  collections  of  imagery  available  for  this 
evaluation  have  always  had  serious  deficiencies  in 
terms  of  the  extent  to  which  there  is  coverage  of 
the  parameter  space.  These  deficiencies  have 
limited  the  ability  of  this  evaluation  to  obtain  its 
goal  of  determining  the  algorithm’s  ability  to  per¬ 
form  the  user  specified  mission.  The  data  bases 
necessary  for  this  type  of  evaluation  require  care¬ 
ful  and  concentrated  attention  from  the  program 
inception.  Sufficient  funds  must  be  made  avail¬ 
able  to  either  collect  operational  imagery  or  to 
generate  synthetic  imagery.  The  evaluation  of 
pilot  collections  early  in  the  program  should  be  a 
guide  to  full-scale  collections. 

Finally,  the  comparison  of  the  color  and  FLIR 
evaluations  raise  a  number  of  interesting  questions 
that  we  cannot  answer  with  the  available  data. 
Does  the  color  algorithm  perform  differently 
(better)  than  either  of  the  FLIR  algorithms?  Does 
the  FLIR  adaptive  algorithm  perform  differently 
than  the  FLIR  non-adaptive  algorithm?  These 
questions  require  the  generation  and  comparison 
of  ROC  curves  for  the  various  algorithms.  The 
RSTA  algorithms  were  not  designed  to  provide 
the  necessary  data  for  this  type  of  analysis.  Such 
data-collection  mechanisms  are  reasonably  easy  to 
implement  and  would  provide  valuable  data  for 
guiding  future  research. 
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Abstract 

The  focus  of  this  MURI  project  is  on  trainable,  mod¬ 
ular,  vision  systems,  that  is,  systems  that  can  be 
easily  reconfigured  for  a  range  of  visual  tasks,  and 
easily  reprogrammed  through  training  on  example 
images.  The  main  subprojects  include:  a  humanoid 
robot  that  serves  as  a  platform  for  biologicaly  mo¬ 
tivated  trainable  visual  modules,  a  computational 
framework  for  designing  visual  search  engines  for  a 
range  of  tasks  in  image  databases  and  video,  and 
reconfigurable  hardware  specialized  for  trainable  vi¬ 
sion  tasks. 

1  Objectives 

Our  multi-institution  project  focuses  on:  1)  devel¬ 
oping  versatile  vision  and  sensing  systems  that  can 
be  easily  customized  for  different  tasks  in  different 
domains,  largely  motivated  by  biological  models  of 
visual  processing,  2)  modeling  and  analyzing  the  ca¬ 
pabilities  of  these  systems,  3)  estimating  the  system 
performance  for  tasks  in  robotics,  inspection,  medi¬ 
cal  diagnosis,  surveillance  and  target  recognition. 

To  reach  these  goals,  a  multidisciplinary  approach  - 
cutting  across  computer  science,  electrical  engineer¬ 
ing,  statistics,  neural  networks,  psychophysics  and 
visual  physiology  -  is  key  if  we  want  to  develop  not 
simply  another  algorithm  but  a  vision  architecture 
that  -  like  biological  systems  -  is  robust,  flexible  and 
easily  adaptable  to  many  different  tasks. 

Computer  vision  will  be  critical  for  advanced  weapon 
systems,  surveillance,  transportation,  medicine  and 
manufacturing  in  the  21st  century.  Unfortunately, 
existing  computer  vision  systems  are  still  too  inflex¬ 
ible  for  widespread  use  in  military  and  commercial 
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applications.  Each  machine  perception  task  is  dif¬ 
ferent  and  typically  requires  very  specific  software 
and  hardware.  A  small,  generic  vision-based  system 
that  could  be  easily  customized  without  significant 
reprogramming  will  meet  the  forthcoming  needs  for 
reliable  and  autonomous  performance. 

Thus  the  central  idea  in  our  approach  is  to  utilize  a 
multidisciplinary  team  to  develop  modular  and  train- 
able  architectures,  inspired  by  biological  systems  and 
motivated  by  sound  theories  in  signal  processing, 
statistics,  function  approximation  and  neural  net¬ 
works.  Modularity  ensures  that  components  may 
easily  be  upgraded  or  replaced  with  alternatives,  or 
connected  in  alternative  arrangements.  Trainability 
ensures  that  the  same  general  framework  can  easily 
be  reused  for  different  tasks,  and  can  be  programmed 
largely  be  exposure  to  examples. 

Towards  this  end,  we  have  built  several  prototypes 
that  are  configured  from  a  set  of  basic  software  and 
hardware  modules  and  trained  to  perform  several 
different  visual  tasks  such  as  image  indexing,  in¬ 
spection  tasks,  reflexive  visual  behaviours  for  au¬ 
tonomous  navigation,  learn  eye-hand  coordination, 
classify  scenes,  and  detect  instances  of  object  classes 
in  images.  Our  project  thus  makes  a  step  towards 
bringing  together  the  disciplines  of  machine  percep¬ 
tion  and  machine  learning  and  towards  developing  a 
system  that  can  display  some  of  the  robustness,  flex¬ 
ibility  and  adaptability  of  biological  visual  systems. 


1.1  Specific  components 


The  project’s  current  foci  include:  a  humanoid  robot 
composed  of  reconfigurable,  trainable,  biologically 
motivated  visual  modules;  a  computational  frame¬ 
work  for  creating  visual  search  engines;  and  support¬ 
ing  projects  in  specialized  hardware,  low-level  fea¬ 
ture  extraction  and  attention.  Below,  we  highlight 
example  successes  and  directions  for  future  work. 
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2  A  Trainable  Architecture  for  Ob¬ 
ject  Classification  and  Detection 

A  central  motivating  example  for  much  of  our  work 
has  been  that  of  designing  search  engines  that  can 
deal  with  large  volumes  of  images,  both  in  static 
databases  and  in  video  streams.  To  be  effective  at 
retrieving  images  in  such  cases,  a  search  engine  must 
be  able  to  deal  with  classes  of  objects  (e.g.,  find  im¬ 
ages  with  people  in  them,  or  with  particular  vehicles 
present),  and  with  contextual  information  (e.g.,  find 
images  of  particular  classes  of  scenes  such  as  moun¬ 
tains,  forests,  cityscapes).  This  means  we  need  a 
computational  framework  for  efficiently  representing 
the  wide  range  of  images  that  particular  classes  of 
objects  can  engender. 

We  are  developing  exactly  such  a  new  framework  for 
synthesizing  search  engines  for  visual  data  bases  of 
images  and  -  in  the  future  -  video.  Our  goal  is  a 
system  that  can  detect  a  wide  range  of  images  and 
objects  in  video  and  databases,  including  classes  of 
objects  such  as  faces,  people,  vehicles,  and  classes  of 
scenes  such  as  mountainscapes,  fields,  beaches  and 
so  on.  The  keys  to  such  a  framework  are  novel  repre¬ 
sentations  of  images  and  their  contents;  flexible  clas¬ 
sifiers  that  can  deal  with  large  variations  in  objects 
and  that  can  be  trained  efficiently  from  examples; 
and  verification  methods  that  can  verify  and  screen 
detections  raised  by  the  classifiers. 

2.1  Object  Detection 

In  the  past  few  years  image-based  techniques  have 
achieved  considerable  success  in  synthesizing  effec¬ 
tive  face  detectors  from  a  training  set  consisting  of  a 
large  number  of  example  face  images.  We  developed 
such  a  system  [28]  which  was  also  demonstrated  for 
the  similar  task  of  detecting  eyes.  Rowley  et  al.[23] 
implemented  a  system  with  a  similar  architecture, 
that  was  significantly  faster.  More  recently  we  have 
replaced  the  classifier  used  by  [28]  with  a  Support 
Vector  Machine  classifier  [18]  achieving  almost  real 
time  performance  on  a  Pentium  platform.  Attempts 
however  to  apply  the  same  architecture  to  the  de¬ 
tection  of  other  types  of  objects  encountered  unex¬ 
pected  difficulties.  It  quickly  became  clear  that  the 
key  problem  was  the  specific  image  representation 
used.  The  Sung-Poggio  system  (and  also  the  sim¬ 
ilar  CMU  system)  use  the  simplest  possible  image 
representation,  just  pixel  values.  A  pixel-based  rep¬ 
resentation  is  sufficient  in  situations  in  which  there 
is  a  reproducible  pattern  of  grey  values,  e.g.,  frontal 
faces.  It  fails  in  tasks  like  the  detection  of  people 
where  the  pixel  pattern  can  be  highly  variable  be¬ 
tween  different  images  of  people.  We  have  now  found 


another  representation  which  is  biologically  plausi¬ 
ble  and  which  seems  to  be  much  better  for  the  diffi¬ 
cult  task  of  detecting  people  in  complex  images.  It 
may  well  be  a  general  represenation  for  many  object 
detection  tasks.  The  representation  we  propose  for 
object  detection  consists  of  an  object  specific  subset 
of  an  overcomplete  dictionary  of  Haar  wavelets. 

The  main  contribution  to  finding  a  better  represen¬ 
tation  than  pixels  is  due  to  Sinha  (1995)  [25]  who 
first  suggested  the  use  of  ratio  templates  as  a  way 
to  encode  qualitative  photometric  and  chromatic  re¬ 
lations  between  image  regions  that  are  practically 
invariant  under  large  changes  of  illumination  and 
viewpoint.  Lipson  [14]  has  extended  the  scheme  of 
Sinha  and  shown  that  it  can  deal  successfully  with 
the  problem  of  natural  scene  classification. 

2.1.1  Sinha’s  ratio  templates 

In  his  thesis  [25],  Sinha  suggested  representing  ob¬ 
jects  as  ratio  templates,  that  is,  sets  of  photometric 
inequalities  between  the  average  brightness  values 
of  pairs  of  appropriately  chosen  large  image  regions. 
The  relations  encode  only  the  directions  of  the  in¬ 
equalities  rather  than  their  magnitudes.  The  image 
regions  and  their  qualitative  inter-relationships  are 
encoded  as  a  directed  graph  and  the  model-to-image 
matching  process  is  handled  as  a  graph-matching 
task.  This  template  matching  approach  can  be  ef¬ 
ficiently  implemented  in  a  multi-resolution  frame¬ 
work,  since  the  average  brightnesses  of  different  re¬ 
gions  can  be  readily  obtained  by  accessing  single 
pixel  values  in  the  appropriate  pyramid  level.  Sinha 
also  devised  a  correlational  learning  scheme  to  ex¬ 
tract  qualitative  object  models  from  sets  of  normal¬ 
ized  example  images.  The  scheme  was  tested  on  the 
problem  of  detecting  faces  using  a  hand-crafted  qual¬ 
itative  model  for  a  human  face  comprised  of  about 
15  inequality  relations  between  image  regions  cor- 
resonding  to  the  forehead,  eyes,  cheeks,  nose,  mouth 
and  chin,  achieving  detection.  As  the  experimental 
results  demonstrate,  these  relations  are  robust  to  the 
photometric  variations  introduced  in  face  images  by 
changing  illumination  conditions.  As  a  face  detec¬ 
tor  the  scheme  did  not,  however,  perform  as  well  as 
Sung’s  or  Osuna’s,  partly  because  of  the  limitations 
of  the  classification  stage  used  by  Sinha  -  a  simple 
template  matching. 

2.1.2  Flexible  templates  for  classification 

Using  this  idea  of  relative  relationships  between  im¬ 
age  regions  as  a  basis,  Lipson  has  developed  a  frame¬ 
work  for  scene  classification  and  object  detection 
based  on  configural  templates.  Such  templates  again 
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utilize  relative  relationships  between  image  regions, 
including  intensity,  color,  texture  and  other  cues  as 
bases.  In  a  model  template,  extended  regions  are  in¬ 
terrelated  by  flexible  spatial  relationships,  which  can 
be  envisioned  as  a  set  of  image  regions  connected  by 
springs.  Between  pairs  of  regions  one  defines  sets  of 
relative  relationships.  A  match  of  a  model  template 
to  an  image  is  measured  by  determining  the  extent 
to  which  such  a  flexible  template  must  be  stretched 
or  twisted  to  fit  the  image,  and  the  extent  to  which 
the  relative  relationships  are  found  to  hold  between 
the  corresponding  image  regions. 

In  initial  testing,  Lipson  was  able  to  show  that  hand 
crafted  templates  are  very  succesful  at  distinguish¬ 
ing  classes  of  scenes,  such  as  “snowy  mountains”  or 
“snowy  mountains  with  lakes”  or  “fields”  or  “wa¬ 
terfalls”  or  “cityscapes” ,  showing  true  positive  rates 
on  the  order  of  80%  with  false  positive  rates  of 
only  about  1%.  This  demonstrates  the  ability  of 
the  approach  to  deal  with  wide  variations  in  color, 
intensity  and  spatial  layout  common  within  such 
large  classes  of  scenes.  Further,  Lakshmi  Ratan  has 
demonstrated  that  by  adding  spatial  information  (in 
this  case  a  Hausdorff  matcher)  within  individual  re¬ 
gions,  one  can  easily  extend  the  system  to  deal  with 
detection  problems,  in  this  case  detecting  instances 
of  images  that  contain  cars. 

Of  course,  hand  crafted  models  are  only  a  testbed. 
The  goal  is  to  create  trainable  systems,  and  we  have 
built  two  such  systems.  The  goal  of  each  system  is 
to  allow  a  user  to  build  his  own  template,  by  simply 
selecting  a  set  of  example  images,  and  letting  the 
system  deduce  a  template  that  captures  the  com¬ 
monality  within  the  set  of  images.  Although  in  many 
cases  there  is  no  unique  common  template  for  a  set 
of  images,  we  have  demonstrated  both  approaches 
as  successfully  extracting  templates  that  correctly 
retrieve  the  target  class.  Current  work  is  focused 
on  methods  for  including  negative  examples,  meth¬ 
ods  to  support  refinement  of  templates  by  allowing 
the  user  to  select  examplars  from  retrieval  sets,  and 
methods  of  merging  multiple  cues  into  a  common 
flexible  template  framework. 

2.1.3  Image  Representation:  overcomplete 
dictionaries  and  sparsification 

A  general  approach  to  object  detection  -  being  de¬ 
veloped  by  Oren,  Sinha,  Girosi  and  Poggio  -  is  to 
consider  representations  of  images  that  can  be  cus¬ 
tomized  for  specific  tasks.  To  this  end,  suppose  we 
consider  examples  of  the  object  class  to  be  detected 
-  like  faces  or  people  -  resized  to  a  standard  size 
dictated  by  the  type  of  object.  Within  this  win¬ 
dow,  consider  the  dictionary  of  Haar  wavelets.  Two¬ 


dimensional  Haar  wavelets  effectively  encode  bright- 
dark  relations  at  various  positions,  scales  and  orien¬ 
tations  in  the  image.  They  do  so  in  a  way  that  can  be 
fully  characterized  mathematically.  A  complete  set 
of  Haar  wavelets  capture  the  full  information  in  an 
image.  In  fact,  we  need  an  overcomplete  dictionary 
because  we  need  more  dense  shifts  of  the  wavelets 
than  strictly  required  by  completeness  (see  Oren  et 
al.  in  this  volume).  We  could  then  simply  analyze 
the  image  window  in  terms  of  this  set  of  basis  func¬ 
tions  and  provide  the  resulting  coefficients  as  inputs 
to  an  appropriate  classifier.  The  dictionary,  how¬ 
ever,  is  too  large  (say  thousands  of  basis  functions) 
to  be  a  practical  solution,  since  a  high  input  dimen¬ 
sionality  in  the  classifier  almost  always  implies  the 
need  for  a  very  large  set  of  training  examples. 

For  this  reason,  in  our  system  we  use  a  “learn¬ 
ing”  stage,  in  which  a  much  sparser  representaion 
is  learned,  based  on  the  task.  As  shown  in  the  paper 
by  Oren  et  al.  (this  volume)  a  few  example  images 
of  the  object  of  interest  are  used  to  find  which  subset 
of  the  basis  functions  is  needed  to  represent  it,  by 
keeping  only  those  with  non-zero  absolute  value  of 
the  coefficient,  averaged  over  the  training  set.  After 
this  dimensionality  reduction  stage,  only  a  few  spe¬ 
cific  basis  functions  are  needed  to  analyze  the  image 
and  to  provide  the  inputs  to  the  classifier. 

2.1.4  Scale  and  Position  Invariance 

As  in  the  original  architecture  proposed  by  Sung 
and  Poggio  (and  by  many  others  before),  detection 
is  performed  by  moving  the  basic  window  used  by 
the  classifier  across  the  image  in  space  and  in  scale. 
Thus,  scale  and  position  invariance  is  achieved  by 
brute  force,  simply  by  scanning  and  zooming  an  im¬ 
age.  As  described  in  Oren  et  al.  (this  volume)  the 
scanning  can  be  done  somewhat  more  elegantly  in 
the  space  of  the  wavelet  coefficients. 

2.1.5  The  learning  engine:  Support  Vector 
Machines 

We  use  the  Support  Vector  Machine  (SVM),  a 
pattern  classification  algorithm  recently  developed 
by  V.  Vapnik  and  his  team  at  AT&T  Bell  Labs. 
[4,  6,  9,  32].  SVM  can  be  seen  as  a  new  way  to  train 
polynomial,  neural  network,  or  Radial  Basis  Func¬ 
tions  classifiers.  While  most  of  the  techniques  used 
to  train  the  above  mentioned  classifiers  are  based  on 
the  idea  of  minimizing  the  training  error,  which  is 
usually  called  empirical  risk,  SVMs  operate  on  an¬ 
other  induction  principle,  called  structural  risk  min¬ 
imization,  which  minimizes  an  upper  bound  on  the 
generalization  error.  From  the  implementation  point 
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of  view,  training  a  SVM  is  equivalent  to  solving  a 
linearly  constrained  Quadratic  Programming  (QP) 
problem  in  a  number  of  variables  equal  to  the  num¬ 
ber  of  data  points.  This  problem  is  challenging  when 
the  size  of  the  data  set  becomes  larger  than  a  few 
thousands.  Osuna,  et  al.,  [17]  have  shown  that  a 
large  scale  QP  problem  of  the  type  posed  by  SVM 
can  be  solved  by  a  decomposition  algorithm:  the 
original  problem  is  replaced  by  a  sequence  of  smaller 
problems,  that  is  proved  to  converge  to  the  global 
optimum.  The  applicability  of  this  approach  was 
demonstrated  by  using  SVM  as  the  core  classifica¬ 
tion  algorithm  in  a  real-time  face  detection  system. 

Here  we  briefly  sketch  the  SVM  algorithm  and  its 
motivation.  A  more  detailed  description  of  SVM 
can  be  found  in  [32]  (chapter  5)  and  [9]. 

We  start  from  the  simple  case  of  two  linearly  sep¬ 
arable  classes.  We  assume  that  we  have  a  data 
set  D  =  {(xi,2/i)}f=i  of  labeled  examples,  where 
y.  £  and  we  wish  to  determine,  among  the 

infinite  number  of  linear  classifers  that  separate  the 
data,  which  one  will  have  the  smallest  generaliza¬ 
tion  error.  Intuitively,  a  good  choice  is  the  hyper¬ 
plane  that  leaves  the  maximum  margin  between  the 
two  classes,  where  the  margin  is  defined  as  the  sum 
of  the  distances  of  the  hyperplane  from  the  closest 
point  of  the  two  classes. 

If  the  two  classes  are  non-separable  we  can  still 
look  for  the  hyperplane  that  maximizes  the  mar¬ 
gin  and  that  minimizes  a  quantity  proportional  to 
the  number  of  misclassification  errors.  The  trade  off 
between  margin  and  misclassification  error  is  con¬ 
trolled  by  a  positive  constant  C  that  has  to  be  cho¬ 
sen  beforehand.  In  this  case  it  can  be  shown  that 
the  solution  to  this  problem  is  a  linear  classifier 
/(x)  =  sign(X^-^i  A.yix'^xi  -b  b)  whose  coefficents 
A,  are  the  solution  of  the  following  QP  problem: 


Minimize 

W(A) 

=  -A^l-b  AA^DA 

A 

subject  to 

A'^y 

=  0 

A -Cl 

<  0 

-A 

<  0 

where  (A),-  =  A*,  (1),-  =  1  and  Dij  —  myjxjxj.  It 
turns  out  that  only  a  small  number  of  coefficients 
Aj  are  different  from  zero,  and  since  every  coefficient 
corresponds  to  a  particular  data  point,  this  means 
that  the  solution  is  determined  by  the  data  points 
associated  to  the  non-zero  coefficients.  These  data 
points,  that  are  called  support  vectors,  are  the  only 
ones  which  are  relevant  for  the  solution  of  the  prob¬ 
lem:  all  the  other  data  points  could  be  deleted  from 


the  data  set  and  the  same  solution  would  be  ob¬ 
tained.  Intuitively,  the  support  vectors  are  the  data 
points  that  lie  at  the  border  between  the  two  classes. 
Their  number  is  usually  small,  and  Vapnik  showed 
that  it  is  proportional  to  the  generalization  error  of 
the  classifier. 

Since  it  is  unlikely  that  any  real  life  problem  can 
actually  be  solved  by  a  linear  classifier,  the  tech¬ 
nique  has  to  be  extended  in  order  to  allow  for 
non-linear  decision  surfaces.  This  is  easily  done 
by  projecting  the  original  set  of  variables  x  in 
a  higher  dimensional  feature  space:  x  £  R‘‘  => 
z(x)  =  (?ii(x),...,(^„(x))  G  i?"  and  by  formulat¬ 
ing  the  linear  classification  problem  in  the  feature 
space.  The  solution  will  have  the  form  /(x)  = 
sign(Ei=i  AiyiZ^(x)z(x,)-l-6),  and  therefore  will  be 
nonlinear  in  the  original  input  variables.  One  has 
to  face  at  this  point  two  problems:  1)  the  choice  of 
the  features  (^j(x),  which  should  be  done  in  a  way 
that  leads  to  a  “rich”  class  of  decision  surfaces;  2)  the 
computation  of  the  scalar  product  z'^(x)z(xi),  which 
can  be  computationally  prohibitive  if  the  number  of 
features  n  is  very  large  (for  example  in  the  case  in 
which  one  wants  the  feature  space  to  span  the  set  of 
polynomials  in  d  variables  the  number  of  features  n 
is  exponential  in  d).  A  possible  solution  to  this  prob¬ 
lems  consists  in  letting  n  go  to  infinity  and  make  the 
following  choice: 

z(x)  =  {^y^ipxix.), . . . ,  ^/^iMx.), .  .  .) 

where  a*  and  V’i  are  the  eigenvalues  and  eigenfunc¬ 
tions  of  an  integral  operator  whose  kernel  K{x,y) 
is  a  positive  definite  symmetric  function.  With  this 
choice  the  scalar  product  in  the  feature  space  be¬ 
comes  particularly  simple  because: 

OO 

z^(x)z(y)  =  ^cviV-’i(x)V>i(y)  =  7f(x,y) 

i  =  l 

where  the  last  equality  comes  from  the  Mercer- 
Hilbert-Schmidt  theorem  for  positive  definite  func¬ 
tions.  The  QP  problem  that  has  to  be  solved 
now  is  exactly  the  same  as  in  eq.  (1),  with 
the  exception  that  the  matrix  D  has  now  ele¬ 
ments  Dij  =  yiyjK{xi,Xj).  As  a  result  of  this 
choice,  the  SVM  classifier  has  the  form:  /(x)  = 
sign(ELi  Xt)+&)-  In  table  (1)  we  list  some 

choices  of  the  kernel  function  proposed  by  Vapnik: 
notice  how  they  lead  to  well  known  classifiers,  whose 
decision  surfaces  are  known  to  have  good  approxima¬ 
tion  properties. 

It  is  worth  noting  that  Haar  wavelets  are  not  the 
only  choice  for  an  appropriate  dictionary.  Smoother 
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wavelets  are  very  similar  and  may  be  more  appropri¬ 
ate.  Additional  dictionaries  of  other  basis  functions 
(like  Fourier  basis)  may  be  added  to  a  basic  dictio¬ 
nary  of  wavelets  and  may  be  especially  desirable  for 
certain  tasks. 

Our  object  detection  framework  at  present  does  not 
use  contextual  information.  However,  as  work  by 
Lipson  has  shown,  the  qualitative  template  frame¬ 
work  provides  a  convenient  way  to  incorporate  con¬ 
textual  knowledge.  We  are  currently  investigating 
whether  the  algorithms  used  by  Lipson  could  be  re¬ 
formulated  in  terms  of  an  overcomplete  but  fixed 
dictionary  of  basis  functions  such  as  wavelets. 

3  The  verification  stage:  Morphable 
models  of  object  classes 

The  detection  scheme  described  so  far  finds  small 
regions  in  the  image  that  are  likely  to  contain  the 
object  sought.  A  verification  stage  is  desirable  to 
reject  false  alarms.  For  this  goal  we  plan  to  use  a 
newly  developed  model  of  classes  of  objects  (such  as 
faces)  and  to  match  this  model  to  the  image  window 
of  interest. 

Our  approach  is  based  on  modeling  object  classes  as 
a  linear  combination  of  prototype  images.  The  mo¬ 
tivation  for  this  model  comes  from  the  linear  class 
concept  [33].  They  showed  that  linear  transforma¬ 
tions  can  be  learned  exactly  from  a  small  set  of  ex¬ 
amples  in  the  case  of  linear  object  classes.  Since 
many  object  transformations  can  be  approximated 
by  linear  transformations,  this  is  a  very  useful  re¬ 
sult.  As  for  practical  applications  which  motivate 
this  work,  there  are  a  number  of  important  applica¬ 
tions  including  man-machine  interfaces,  image  anal¬ 
ysis,  video  teleconferencing,  image  compression  and 
object  verification. 

Cootes  and  Taylor  [8,  7]  proposed  a  similar  model  for 
object  classes.  They  also  use  a  linear  combination 
of  prototypes  as  an  object  model.  Their  work  differs 
from  ours  in  the  algorithm  they  used  to  fit  a  model 
to  an  image,  and  in  the  use  of  a  sparse  set  of  control 
points  as  opposed  to  a  full  pixelwise  correspondence 
matrix  for  their  models.  The  work  of  Ullman  and 
Basri  [31]  and  Shashua  [24]  provided  strong  motiva¬ 
tion  for  this  work.  They  showed  that  any  view  of  a 
single  object  can  be  represented  eis  a  linear  combi¬ 


Kernel  Function 

Type  of  Classifier 

K{x,Xi)  =  exp(-||x  -  Xi|P) 

Gaussian  RBF 

A'(x,Xi)  =  (x'^’xj  -1-  ly 

Polynomial,  degree  d 

/f(x,  Xj)  =  tanh(x^  Xj  —  0) 

Multilayer  Perceptron 

Table  1:  Some  possible  kernel  functions  and  the  type 
of  decision  surface  they  define 


nation  of  just  three  views  of  the  object.  Our  work  is 
concerned  with  extending  this  idea  to  object  classes. 

Our  research  focused  on  two  main  problems:  model¬ 
ing  object  classes  and  matching  the  model  to  a  novel 
image.  Our  approach  to  the  first  problem  is  to  model 
object  classes  as  a  linear  combination  of  example  im¬ 
ages  called  prototypes.  Linearly  combining  images 
means  treating  the  images  as  vectors.  In  order  to  lin¬ 
early  combine  these  vectors,  we  must  have  pixelwise 
correspondence  between  the  images.  Otherwise,  the 
linear  combination  would  not  form  a  vector  space. 
So,  in  addition  to  the  prototype  images  as  input,  we 
also  require  the  pixelwise  correspondences  as  input. 
The  correspondences  are  represented  as  a  flow  fields 
from  one  prototype  chosen  as  the  reference  proto¬ 
type  and  each  of  the  other  prototypes.  Given  this, 
our  model  for  object  classes  consists  of  two  compo¬ 
nents,  namely  shape  and  texture  vecors.  The  shape 
of  a  model  image  is  a  linear  combination  of  proto¬ 
type  shapes  (flow  fields).  Analogously,  the  texture  of 
a  model  image  is  a  linear  combination  of  prototype 
textures  (grey  level  images).  The  model  thus  con¬ 
tains  two  parameters  per  prototype:  one  for  shape 
and  one  for  texture.  Adjusting  these  parameters  al¬ 
lows  one  to  create  a  large  variety  of  model  images 
by  morphing  the  reference  image  according  to  the 
linear  combination  of  shape  and  texture  specified  by 
the  model  parameters. 

Our  approach  to  the  problem  of  matching  a  model  to 
a  novel  image  is  the  following.  We  first  define  an  er¬ 
ror  measure  between  the  input  novel  image  and  the 
current  guess  for  the  best  fitting  model  image.  This 
error  is  simply  the  L2  error  between  the  pixels  (grey 
level  values)  of  the  novel  and  model  images.  To  ob¬ 
tain  the  model  image,  we  must  morph  the  reference 
image  according  to  the  model  parameters.  This  pro¬ 
duces  an  image  against  which  we  can  compare  the 
novel  input  image.  Given  this  error  function  we  find 
the  best  matching  model  image  by  optimizing  the 
error  with  respect  to  the  model  parameters  using  a 
stochastic  gradient  descent  algorithm. 

The  model  and  matching  algorithm  are  described 
more  fully  in  two  papers  in  this  volume  (Jones  and 
Poggio;  Jones,  Vetter  and  Poggio). 

4  An  application:  an  Intelligent  Web 
Crawler 


The  problem  of  searching  for  information  on  the 
World-Wide- Web  has  become  of  critical  importance 
with  its  enormous  growth  in  the  recent  years.  To 
address  this  problem  new  search  engines  and  site  in¬ 
dices  have  been  developed  (e.g.,Alta-Vista,  Yahoo, 
Excite)  and  they  have  become  an  essential  part  of 
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the  Web.  However,  these  search  engine  are  based 
mainly  on  textual  analysis  (e.g.,  word  statistics)  of 
the  site  content  or  on  categorical  classification  (e.g., 
business,  sports,  magazines,  etc.).  An  important  as¬ 
pect  of  the  Web  is  the  inclusion  of  other  media  forms 
such  as  images  and  video.  These  media  are  not  uti¬ 
lized  for  the  task  of  information  search.  To  take 
advantage  of  the  visual  content  of  the  various  Web 
sites,  we  have  developed  a  search  engine  that  is  based 
on  computer  vision  techniques  which  are  applied  to 
images  collected  on  the  web  during  the  search.  Such 
a  search  engine  enables  the  user  to  search  for  sites 
that  contain  images  of  certain  types  (e.g.  images 
of  people  or  cars)  or  even  to  search  for  a  particular 
person  (e.g.  Clinton). 

4.1  System  Architecture 

The  architecture  of  the  search  engine  -  as  designed 
by  Michael  Oren  and  Jason  Miller  -  is  based  on  the 
Web-crawler  scheme.  The  client  (the  user)  sends  to 
the  server  (the  search  engine)  a  URL  address  that 
is  the  starting  point  of  the  search.  The  web  crawler 
visits  the  given  address,  locates  all  the  images,  and 
retrieves  them  to  the  server  site.  The  server  converts 
the  images  to  a  standard  format  and  sends  them  to 
the  image  analysis  module  which  provides  informa¬ 
tion  on  the  visual  content  of  the  image  (e.g.,  does 
the  image  contain  a  frontal  face?)  Beised  on  infor¬ 
mation  that  the  server  gets  from  the  image  analysis 
module  it  formats  a  result  page  and  sends  it  back 
to  the  client.  The  server  also  identifies  all  the  links 
to  other  sites  (URL  addresses)  that  follow  from  the 
current  site  and  continues  the  search. 

4.2  System  Implementation 

The  web  crawler  engine  was  implemented  using  the 
Java  programming  language.  The  image  analysis 
module  was  implemented  in  C/C-)— (-  since  it  is  com¬ 
putationally  intensive.  Currently  we  are  using  a  rou¬ 
tine  for  face  detection  in  a  cluttered  scene  [28]  [17] 
that  determines  whether  or  not  the  image  contains 
frontal  faces  and  locates  their  positions. 

The  web  crawler  can  be  used  not  only  as  search 
engine  but  also  as  a  pre-loader  that  download  pre¬ 
selected  information  from  the  Web  to  the  local  disk. 
Pre-loading  is  a  useful  feature  especially  for  users 
with  low  bandwidth  Internet  access  and  when  view¬ 
ing  images  on  account  of  their  large  size.  Unlike  cur¬ 
rent  commercial  pre-loader  software  packages,  which 
are  limited  to  certain  pre-defined  sites,  the  vision- 
based  web  crawler  can  search  new  sites  and  down¬ 
load  images  based  on  their  content. 

We  are  working  on  adding  a  face  recognition  module 


which,  together  with  the  face  detection  routine,  will 
enable  the  crawler  to  search  for  a  specific  person’s 
image  on  the  Web.  We  also  plan  to  augment  the  web 
crawler  with  object  detection  and  scene  classification 
routines  currently  under  development. 

5  Cog:  A  Humanoid  Robot 

Although  a  primary  focus  of  our  collaborative  work 
is  in  establishing  a  framework  for  designing  search 
engines,  we  are  also  examining  the  utility,  robustness 
and  flexiblity  of  our  components  in  a  very  different 
framework.  This  involves  Cog  -  a  ten  degree  of  free¬ 
dom  upper  torso  humanoid  robot  with  a  fully  mo¬ 
bile  head /eye  system,  coupled  to  a  with  a  six  degree 
of  freedom  series-elastic  actuated  arm.  Cog  serves 
as  an  excellent  platform  on  which  to  design,  explore 
and  evaluate  visual  modules,  since  the  real-time  per¬ 
formance  capabilities  of  the  robot  provide  a  unique 
environment  in  which  to  test  system  designs.  Many 
of  the  visual  modules  used  in  our  visual  search  en¬ 
gines,  especially  low  level  feature  extractors,  focus  of 
attention  methods  and  tracking  methods,  have  also 
been  incorporated  into  Cog  and  tested  in  real  time 
visual  environments. 

Further,  Cog  has  served  as  a  focal  point  for  exam¬ 
ining  the  trainability  of  different  visual  components. 
Examples  include  automatically  training  Cog  to  re¬ 
late  topographic  maps  from  different  modalities  (e.g. 
oculomotor,  visual  tracking,  auditory),  to  use  such 
maps  to  orient  to  visual  stimuli,  to  attend  to  distinc¬ 
tive  visual  stimuli,  to  learn  ego-motion  relationships, 
to  locate  and  track  faces,  to  learn  hand-eye  coordina¬ 
tion  and  visually-guided  pointing,  and  to  coordinate 
stereo  information  with  other  perceptual  cues. 

We  expect  that  as  other  components  of  our  work  in 
scene  and  object  detection  mature,  they  will  also  be 
incorporated  into  the  growing  suite  of  visual  tech¬ 
niques  available  to  our  humanoid  robot. 
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Abstract 

This  report  summarizes  research  being  con¬ 
ducted  under  the  DoD  MURI  Program 
by  the  University  of  Maryland,  the  MIT 
Media  Laboratory,  and  the  University  of 
Washington.  Areas  being  investigated  in¬ 
clude  motion  modeling;  human  motion 
analysis;  person  tracking;  face  tracking  and 
interpretation;  target  recognition;  vehicle 
detection  and  tracking;  self-tuning  of  lU  al¬ 
gorithms;  multi-scale  discriminant  analysis 
and  recognition;  robust  estimation  of  opti¬ 
cal  flow;  deformable  contours;  geometry  of 
plane  curves;  model-based  curve  recogni¬ 
tion;  3D  curve  reconstruction;  and  texture 
discrimination  and  description. 

1  Introduction 

The  Computer  Vision  Laboratory  of  the  Center  for 
Automation  Research  (CfAR)  at  the  University  of 
Maryland  is  conducting  research  on  many  aspects 
of  image  understanding.  This  paper  deals  primarily 
with  the  research  being  conducted  under  Contract 
N00014-95- 1-0521,  entitled  “Appearance-Based  Vi¬ 
sion  for  Complex  Environments” ,  which  was  funded 
under  the  Department  of  Defense’s  Multidisciplinary 
University  Research  Initiative  (MURI)  Program. 
The  agent  for  this  effort  is  the  Office  of  Naval  Re¬ 
search;  the  COTR  is  Dr.  Harold  L.  Hawkins. 

The  MURI  project  at  CfAR  deals  with  video  surveil¬ 
lance  and  tracking,  with  emphasis  on  complex  scenes 
that  contain  humans  or  vehicles.  Co-principal  inves¬ 
tigators  on  this  project  at  CfAR  are  Rama  Chel- 
lappa  and  Larry  S.  Davis.  Also  participating  in 
the  research  are  two  subcontractors:  the  MIT  Me¬ 
dia  Laboratory,  Cambridge,  MA  (Principal  investi¬ 
gators:  Alex  P.  Pentland  and  Rosalind  W.  Picard); 
and  the  Department  of  Electrical  Engineering  at  the 
University  of  Washington,  Seattle,  WA  (Principal  in¬ 


vestigator:  Robert  M.  Haralick).  The  following  sec¬ 
tions  of  this  report  briefly  summarize  the  research 
being  conducted  on  the  project.  Specific  aspects 
of  this  research  are  described  in  greater  detail  in 
four  separate  papers  in  these  Proceedings,  as  ref¬ 
erenced  below.  The  url  for  the  project  web  page  is 
http://www.cfar.umd.edu/cvl/MURI/. 

2  Motion  Modeling  [Yacoob  and 
Davis,  1997a,  1997b;  Black  et  al., 

1997a;  Black  et  al,  in  preparation] 

A  model  has  been  developed  for  computing  im¬ 
age  flow  in  image  sequences  containing  a  very  wide 
range  of  instantaneous  flows.  This  model  integrates 
the  spatio-temporal  image  derivatives  from  multiple 
temporal  scales  to  provide  both  reliable  and  accurate 
instantaneous  flow  estimates.  The  integration  em¬ 
ploys  robust  regression  and  automatic  scale  weight¬ 
ing  in  a  generalized  brightness  constancy  framework. 
In  addition  to  instantaneous  flow  estimation  the 
model  supports  recovery  of  dense  estimates  of  im¬ 
age  acceleration  and  can  be  readily  combined  with 
parameterized  flow  and  acceleration  models.  Its  per¬ 
formance  has  been  demonstrated  on  image  sequences 
of  typical  human  actions  taken  with  a  high  frame- 
rate  camera.  Further  details  about  the  model  can 
be  found  in  a  separate  paper  in  these  Proceedings. 

As  non-rigid  and  non-cohesive  objects  move  or 
change  state,  their  projected  appearance  onto  the 
image  plane  of  the  camera  changes.  These  changes  in 
appearance  due  to  deforming,  articulating  or  erup¬ 
tive/emergent  events  can  be  accounted  for  by  two 
classes  of  image-based  appearance  events.  The  first 
class  accounts  for  the  deformations  and  articulations 
of  the  object  in  the  image  plane;  we  describe  such 
form  changes  as  “warps”  of  the  object  appearance 
in  the  image  plane.  The  second  class  accounts  for 
intensity  changes  resulting  from  occlusion,  disocclu- 
sion,  and  changes  in  material  properties  of  the  ob¬ 
ject.  We  model  these  intensity  variations  by  means 
of  intensity  templates  and  refer  to  them  as  iconic 
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changes.  The  evolving  appearance  of  a  non-rigid  and 
non-cohesive  object  is  then  modeled  using  a  com¬ 
bination  of  form  changes  and  iconic  changes.  We 
are  developing  a  method  for  learning  parameterized 
models  of  form  changes  and  iconic  changes,  and  an 
algorithm  for  recovering  them  from  image  sequences. 
These  changes  in  image  appearance  can  be  used  for 
recognition  of  various  object  deformations;  we  have 
illustrated  this  with  examples  of  non-rigid  and  non- 
cohesive  human  mouth  motion  in  natural  speech. 

3  Human  Motion  Analysis  [Black  et 
al,  1997b;  Horprasert  et  al,  1997;  Ju  et 
al,  1996;  Yacoob  et  al,  to  appear] 

We  have  developed  a  representation  in  which  we  ap¬ 
proximate  the  non-rigid  motion  of  a  person  using  a 
set  of  parameterized  models  of  optical  flow.  While 
parameterized  flow  models  (for  example,  affine  flow) 
have  been  used  for  representing  image  motion  in 
rigid  scenes.  Black  and  Yacoob  observed  that  simple 
parameterized  models  can  well  approximate  more 
complex  motions  if  they  are  localized  in  space  and 
time.  Moreover,  they  showed  that  the  motion  of  one 
body  region  (for  example,  the  face  region)  could  be 
used  to  stabilize  that  body  part  in  a  warped  image 
sequence.  This  allowed  the  image  motions  of  facial 
features  (the  eyes,  mouth,  and  eyebrows)  to  be  esti¬ 
mated  relative  to  the  motion  of  the  face.  Isolating 
the  motions  of  these  features  from  the  motion  of  the 
face  is  critical  for  recognizing  facial  expressions  using 
motion. 

These  parameterized  motion  models  can  be  ex¬ 
tended  to  model  the  articulated  motion  of  the  hu¬ 
man  limbs.  Limb  segments  can  be  approximated  by 
planes  and  the  motion  of  these  planes  can  be  re¬ 
covered  using  a  simple  eight-parameter  optical  flow 
model.  Constraints  can  be  added  to  the  optical  flow 
estimation  problem  to  model  the  articulation  of  the 
limbs  and  the  relative  image  motions  of  the  limbs  can 
be  used  for  recognition.  We  have  defined  a  “card¬ 
board  person  model”  in  which  a  person’s  limbs  are 
represented  by  a  set  of  connected  planar  patches. 
The  parameterized  image  motion  of  these  patches 
is  constrained  to  enforce  articulated  motion  and  is 
solved  for  directly  using  a  robust  estimation  tech¬ 
nique.  The  recovered  motion  parameters  provide 
a  rich  and  concise  description  of  the  activity  that 
can  be  used  for  recognition.  We  are  developing  a 
method  for  performing  view-based  recognition  of  hu¬ 
man  activities  from  the  optical  flow  parameters  that 
extends  previous  methods  to  cope  with  the  cyclical 
nature  of  human  motion.  We  have  illustrated  our 
method  with  examples  of  tracking  human  legs  over 
long  image  sequences. 

We  are  also  developing  a  method  of  estimating  3D 


head  orientation  in  a  monocular  image  sequence. 
The  method  employs  recently  developed  image- 
based  parameterized  tracking  methods  for  the  face 
and  face  features  to  locate  the  area  in  which  sub¬ 
pixel  parameterized  shape  estimation  of  the  eye’s 
boundary  is  performed.  This  involves  tracking  of  five 
points  (four  at  the  eye  corners;  the  fifth  is  the  tip  of 
the  nose).  Our  approach  relies  on  the  coarse  struc¬ 
ture  of  the  face  to  compute  orientation  relative  to 
the  camera  plane.  It  employs  projective  invariance 
of  the  cross-ratios  of  the  eye  corners  and  anthropo¬ 
metric  statistics  to  estimate  the  head  yaw,  roll  and 
pitch. 

4  Real-Time  Person  Tracking 
(M.LT.  Media  Laboratory) 

[Azarbayejani  and  Pentland,  1996; 
Pentland  et  al.,  1997] 

We  have  developed  an  estimation  technique  for  re¬ 
covering  3-D  object  shape  and  motion  and  multiple- 
camera  geometry  from  2-D  features.  The  3-D  ob¬ 
jects  and  2-D  features  are  both  represented  using 
moment-based  physical  models  called  blobs.  Non¬ 
linear  optimization  techniques  are  used  for  estima¬ 
tion;  the  Levenberg-Marquardt  technique  is  used  for 
static  parameters  and  the  extended  Kalman  filter  is 
used  for  dynamic  estimation. 

This  estimation  method  has  been  implemented  as 
part  of  the  M.I.T.  Media  Laboratories’  Smart  Spaces 
project.  It  is  known  as  the  STIVE  system  (Stereo 
Interactive  Video  Environment),  and  runs  in  real 
time  using  two  SGI  Indigo  computers  with  no  special 
hardware  of  any  kind. 

Experimental  results  show  that  the  STIVE  system 
can  obtain  good  quantitative  3-D  physical  descrip¬ 
tions  from  coarse  2-D  image  observations  of  people. 
We  have  demonstrated  that  this  method  can  be  used 
to  self-calibrate  stereo  cameras  from  watching  peo¬ 
ple  move,  and  subsequently  to  determine  the  loca¬ 
tions,  orientations,  and  shapes  of  parts  of  a  person 
to  an  accuracy  of  2  cm,  2  degrees,  and  a  few  percent, 
respectively  (RMS  errors). 

Perhaps  the  most  important  performance  evalua¬ 
tion,  however,  is  that  the  STIVE  system  has  run 
reliably  for  dozens  of  hours,  with  dozens  of  different 
subjects,  in  several  different  locations,  and  in  real 
time  (20-30  fps)  using  only  standard  workstations. 
The  key  to  the  robust,  real-time  performance  is  that 
the  2-D  blob  features  on  which  the  estimation  relies 
can  be  reliably  and  efficiently  extracted  and  matched 
in  a  bottom-up  fashion. 

We  feel  that  use  of  this  type  of  feature  is  a  signif¬ 
icant  departure  from  the  traditional  notions  of  im¬ 
age  features  (e.g.,  points,  lines)  and  image  cues  (e.g., 
motion  fields,  shading),  and  can  lead  to  a  basis  for 
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practical  3-D  vision  systems  in  application  domains 
where  traditional  approaches  have  not  had  a  great 
deal  of  success.  Although  the  blob  models  provide 
only  rigid  motion  and  coarse  shape  information,  they 
are  fast  and  extremely  reliable;  thus  further  preci¬ 
sion  and  higher  levels  of  detail,  if  desired,  can  be 
safely  bootstrapped  from  this  level  of  representation, 
potentially  leading  to  a  powerful  “coarse-to-fine”  or 
“subsumption”  approach  to  3-D  shape  and  motion 
analysis. 

More  recent  work  on  real-time  tracking  and  classifi¬ 
cation  of  human  behavior  is  described  in  a  separate 
paper  in  these  Proceedings. 

5  Real-Time  Face  Tracking  and 
Interpretation  (M.I.T.  Media 
Laboratory)  [Darrel  et  al,  1996; 

Oliver  et  ah,  1996] 

We  have  developed  a  real-time  system  for  finding, 
tracking,  and  analyzing  the  human  face  and  mouth. 
The  LAFTER  system  (which  stands  for  Lips  And 
Face  TrackER)  has  two  outputs:  (1)  control  of 
camera  pan/tilt/zoom  to  maintain  a  well-centered 
image  of  appropriate  resolution,  and  (2)  recogni¬ 
tion  of  mouth  shapes  using  hidden  Markov  models 
(HMMs).  The  system  runs  on  a  single  SGI  Indigo 
computer,  and  produces  3-D  estimates  of  head  po¬ 
sition  that  are  surprisingly  accurate.  Classification 
accuracy  for  mouth  shapes  is  consistently  above  95% 
across  all  users. 

Using  a  single  SGI  Indigo  with  a  200Mhz  R4400  pro¬ 
cessor,  the  average  frame  rate  for  tracking  is  typi¬ 
cally  25  Hz.  When  mouth  detection  and  parameter 
extraction  is  added  to  the  face  tracking,  the  average 
frame  rate  is  14  Hz.  The  RMS  error  between  the 
true  3-D  location  and  the  system’s  output  was  on 
the  order  of  1%. 

Our  approach  to  temporal  interpretation  of  facial  ex¬ 
pressions  uses  HMMs  to  recognize  different  patterns 
of  mouth  movement.  Using  mouth  shape  as  the  in¬ 
put  feature  vector  we  trained  five  different  HMMs  for 
each  of  the  following  mouth  configurations:  neutral 
or  default  mouth  position,  extended/smile  mouth, 
sad  mouth,  open  mouth  and  extended-fopen  mouth 
(such  as  in  laughing). 

Recognition  results  for  eight  different  users  making 
over  2000  expressions  were  typically  about  98%  on 
training  and  96%  on  testing.  The  LAFTER  system 
has  also  been  tested  on  hundreds  of  users  at  different 
events,  each  with  its  own  lighting  and  environmen¬ 
tal  conditions;  the  system  performed  well  in  almost 
every  case,  with  most  failures  being  attributable  to 
dense  beards  or  extreme  skin  color. 


6  Target  Recognition  (University  of 
Washington)  [Liu  and  Haralick,  1997] 


There  are  two  difficulties  with  the  classical  use  of 
templates  for  target  recognition;  one  hcis  to  do  with 
the  spatial  perturbation  of  the  target  in  the  im¬ 
age  and  the  other  has  to  do  with  the  fact  that  the 
gray-scale  perturbing  noise  is  non-additive  and  non- 
Gaussian.  Therefore  when  the  template  matching  or 
matched  filtering  is  done,  the  quantity  computed  is 
not  related  to  the  log  probability  of  the  data  given 
the  target.  Our  research  has  concentrated  on  un¬ 
derstanding  how  to  deal  with  the  second  effect,  the 
effect  of  non-additive,  non-Gaussian  perturbation. 

The  idea  behind  the  approach  is  to  use  in  a  tem¬ 
plate  only  those  areas  of  the  target  which  can  effec¬ 
tively  contribute  to  the  detection  and  localization 
of  the  target.  Using  target  areas  which  are  spatially 
non-distinct  and  low-contrast  not  only  does  not  con¬ 
tribute,  but  in  effect  can  negatively  affect  detection 
accuracy  and  localization.  The  areas  of  the  tar¬ 
get  that  are  least  affected  by  non-additive  Gaussian 
noise  are  those  in  which  there  are  relatively  high- 
contrast  spatial  structures.  The  high  contrast  will 
be  the  dominant  effect  relative  to  the  non-Gaussian 
noise  perturbation  and  will  thereby  permit  good  de¬ 
tection  even  with  a  technique  which  assumes  an  ad¬ 
ditive  Gaussian  perturbation.  The  spatial  structure 
will  permit  good  localization  of  the  position  of  the 
detected  target.  Our  work  has  concentrated  on  de¬ 
veloping  a  methodology  to  automatically  locate  such 
areas  on  a  target  so  that  they  can  be  assembled  to¬ 
gether  as  a  template. 

The  technique  relies  on  being  able  to  analyti¬ 
cally  propagate  the  perturbation  of  the  image  data 
through  to  the  covariance  of  the  location  estimated 
for  the  position  of  the  target.  To  determine  which 
additional  target  areas  should  be  added  to  the  tem¬ 
plate,  we  find  the  direction  6  in  which  the  detected 
target  location  has  highest  variance  using  the  set  of 
target  areas  currently  in  the  template.  Then  we 
find  the  not  yet  used  target  area  which  when  used 
for  target  location  has  smallest  variance  in  the  di¬ 
rection  0. 

The  theoretical  basis  of  the  covariance  propagation 
method  was  published  in  (Haralick,  1996).  This  pa¬ 
per  showed  how  to  compute  the  covariance  of  any 
estimate  obtained  by  either  unconstrained  or  con¬ 
strained  optimization.  It  illustrated  the  application 
of  the  technique  to  a  variety  of  computer  vision  prob¬ 
lems.  A  discussion  of  this  technique  as  it  applies  to 
target  recognition  is  presented  in  a  separate  paper 
in  these  Proceedings. 
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7  Vehicle  Detection  and  Tracking 

[Betke  et  al.,  1996] 

A  vision  system  has  been  developed  that  recognizes 
and  tracks  multiple  vehicles  from  sequences  of  gray¬ 
scale  images  taken  from  a  moving  car  in  hard  real 
time.  Recognition  is  accomplished  by  combining  the 
analysis  of  single  image  frames  with  the  analysis  of 
the  motion  information  provided  by  multiple  consec¬ 
utive  image  frames.  In  single  image  frames,  cars  are 
recognized  by  matching  deformable  gray-scale  tem¬ 
plates,  by  detecting  image  features,  such  as  corners, 
and  by  evaluating  how  these  features  relate  to  each 
other.  Cars  are  also  recognized  by  differencing  con¬ 
secutive  image  frames  and  by  tracking  motion  pa¬ 
rameters  that  are  typical  for  cars. 

The  vision  system  utilizes  the  hard  real-time  oper¬ 
ating  system  Maruti  which  guarantees  that  the  tim¬ 
ing  constraints  on  the  various  vision  processes  are 
satisfied.  The  dynamic  creation  and  termination  of 
tracking  processes  optimizes  the  amount  of  compu¬ 
tational  resources  spent  and  allows  fast  detection 
and  tracking  of  multiple  cars.  Experimental  results 
demonstrate  robust,  real-time  recognition  and  track¬ 
ing  over  thousands  of  image  frames. 

8  Design  of  Self- Tuning  lU 
Algorithms  [Shekhar  et  a/.;1997] 

Image  Understanding  (lU)  systems  used  in  challeng¬ 
ing  operational  environments  should  provide  both 
flexibility  and  convenience.  Flexibility  here  means 
the  ability  to  accommodate  variations  in  the  char¬ 
acteristics  of  the  input  data.  Convenience  means 
that  an  image  analyst  unfamiliar  with  the  techni¬ 
cal  details  of  the  lU  system  can  obtain  satisfactory 
results  from  it. 

Flexibility  is  achieved  by  providing  a  number  of  tun¬ 
ing  parameters,  whose  optimal  setting  usually  en¬ 
sures  satisfactory  performance.  Such  performance, 
however,  is  rarely  achieved  in  practice  since  the 
strategies  used  by  the  lU  system  developer  to  find 
the  optimal  parameter  settings  are  not  conveniently 
available  to  the  image  analyst.  The  goal  of  our  re¬ 
search  is  to  achieve  the  conflicting  goals  of  conve¬ 
nience  and  flexibility  by  providing  a  methodology 
for  automatically  achieving  optimal  parameter  set¬ 
tings  under  operational  conditions.  The  image  an¬ 
alyst  provides  input  in  the  form  of  qualitative  eval¬ 
uations  of  the  results  of  lU  processing,  which  are 
then  interpreted  in  a  rule-based  framework  to  make 
the  necessary  adjustments  to  the  appropriate  algo¬ 
rithms.  In  this  manner  the  lU  system  is  given  the 
capacity  to  tune  itself  for  optimal  performance. 

In  our  previous  work  [Shekhar  et  al.,  1996],  we  dis¬ 
cussed  the  knowledge-based  semantic  integration  of 


lU  algorithms  using  the  OCAPI  architecture.  In 
this  architecture,  the  reasoning  of  the  lU  specialist 
is  formally  represented  using  frames  and  production 
rules.  Mechanisms  are  provided  for  program  super¬ 
vision  tasks  such  as  algorithm  selection  and  tuning. 
We  have  extended  this  work  to  handle  more  complex 
program  supervision  strategies.  We  use  the  LAMA 
architecture  [Vincent  ei  al,  1996]  to  implement  our 
ideas.  We  are  currently  testing  this  approach  on  the 
vehicle  detection  algorithms  developed  at  the  Uni¬ 
versity  of  Maryland  [Chellappa  et  al,  1996].  Details 
about  this  work  can  be  found  in  a  separate  paper  in 
these  Proceedings. 

9  Discriminant  Analysis  and 
Recognition  [Etemad,  1996] 

A  successful  pattern  recognition  scheme  starts  with 
efficient  extraction  of  the  most  discriminant  infor¬ 
mation  elements  from  various,  possibly  imprecise, 
sources,  followed  by  an  intelligent  combination  of 
this  information  in  a  context-dependent  framework 
of  low  complexity. 

Conventional  multiscale  basis  selection  and  feature 
extraction  using  criteria  based  on  compression  or  ap¬ 
proximation  are  not  necessarily  the  best  approaches 
for  classification  and  segmentation  purposes.  In¬ 
stead,  a  class  separability  based  approach  is  prefer¬ 
able.  We  have  developed  methodologies  for  lower¬ 
dimensional  adaptive  multi-scale  discriminant  basis 
selection.  Depending  on  the  task,  these  methodolo¬ 
gies  are  applied  to  local  windows  or  to  the  whole  pat¬ 
tern.  Our  tools  in  this  analysis  are  derived  from  the¬ 
ories  of  wavelet  packets  and  multi-scale  local  bases 
on  the  one  hand,  and  from  the  statistical  theory  of 
discriminant  cluster  analysis  on  the  other  hand.  The 
goal  is  to  find  efficient  multi-scale  representations 
that  yield  maximum  between-class  separations  and 
minimum  within-class  scatters. 

We  have  achieved  improved  classification  reliability 
through  context-dependent  integration  of  soft  deci¬ 
sions.  We  have  investigated  the  effectiveness  of  soft 
decisions  in  representing  the  vagueness,  uncertainty 
and  imprecision  of  the  classification  sources.  Based 
on  the  principle  of  least  commitment  in  designing 
pattern  recognition  and  consensus-theoretical  con¬ 
cepts,  we  improve  the  reliability  of  our  classification 
system  through  integration  of  soft  decisions  obtained 
from  various  observations  and/or  sources.  The  com¬ 
bination  of  decisions  is  based  on  the  discrimination 
power  of  each  source  and  its  relevance  to  the  current 
observation.  Our  approach  uses  ideas  from  consen¬ 
sus  theory,  fuzzy  neural  learning,  and  evidential  rea¬ 
soning. 

Our  methods  of  multi-scale  local/global  basis  se¬ 
lection  and  context-dependent  decision  integration 
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have  been  applied  to  several  different  domains,  in¬ 
cluding  texture  and  document  image  classification 
and  segmentation,  radar  signature  classification,  and 
human  face  recognition.  The  results  show  that  su¬ 
perior  or  highly  competitive  performance  can  be  ob¬ 
tained  using  small  feature  sets  and  simple  classifiers. 
The  resulting  systems  are  typically  of  low  complex¬ 
ity  and,  since  no  iterative  computations  are  involved, 
most  of  the  calculations  can  be  done  in  parallel.  The 
proposed  ideas  can  be  extended  in  several  directions 
and  can  be  applied  to  many  pattern  recognition  and 
segmentation  tasks. 

10  Optical  Flow  Estimation 

[Srinivasan  and  Chellappa,  1996] 

Computation  of  optical  flow  has  been  formulated  as 
nonlinear  optimization  of  a  cost  function  comprising 
a  gradient  constraint  term  and  a  field  smoothness 
factor.  Results  obtained  using  these  techniques  are 
often  erroneous,  highly  sensitive  to  numerical  preci¬ 
sion,  and  determined  sparsely,  and  they  carry  with 
them  all  the  pitfalls  of  nonlinear  optimization.  We 
have  regularized  the  gradient  constraint  equation  by 
modeling  optical  flow  as  a  linear  combination  of  an 
overlapped  set  of  basis  functions.  We  have  devel¬ 
oped  a  theory  for  estimating  model  parameters  ro¬ 
bustly  and  reliably.  We  prove  that  an  extended  least 
squares  solution  can  be  obtained  that  is  unbiased 
and  robust  to  small  perturbations  in  the  estimates 
of  gradients  and  to  mild  deviations  from  the  gra¬ 
dient  constraint.  The  solution  can  be  obtained  by 
a  numerically  stable  sparse  matrix  inversion,  giving 
a  reliable  flow  field  estimate  over  the  entire  frame. 
Experimental  results  of  our  scheme  are  surprisingly 
accurate  and  consistent  across  a  variety  of  images, 
in  comparison  with  the  standard  optical  flow  algo¬ 
rithms.  We  believe  that  our  flow  field  model  offers 
higher  accuracy  and  robustness  than  conventional 
optical  flow  techniques,  and  is  better  suited  for  im¬ 
age  stabilization,  mosaicking  and  video  compression. 

11  Deformable  Contours  [Gavrila, 
1996] 

We  have  developed  a  Hermite  representation  for  de¬ 
formable  contour  finding.  This  representation  com¬ 
pares  favorably  in  terms  of  versatility  and  controlla¬ 
bility  with  other  local  contour  representations  that 
have  been  used  previously  for  this  purpose.  The  Her¬ 
mite  representation  allows  a  compact  representation 
of  curved  shapes,  without  the  smoothing  out  of  cor¬ 
ners.  It  is  also  well  suited  for  both  interactive  and 
tracking  applications. 

The  Hermite  representation  has  been  used  to  formu¬ 
late  the  contour  finding  problem  as  an  optimization 
problem  using  a  maximum  a  posteriori  energy  cri¬ 


terion.  Optimization  is  performed  by  dynamic  pro¬ 
gramming.  Our  approach  to  contour  tracking  decou¬ 
ples  the  effects  of  transformation  and  deformation, 
using  a  template  matching  strategy  to  robustly  ac¬ 
count  for  the  transformation  effect.  We  have  demon¬ 
strated  these  ideas  on  a  variety  of  images  from  dif¬ 
ferent  domains. 

12  Differentialless  Geometry  [Latecki 
and  Rosenfeld,  1996] 

We  have  developed  foundations  for  a  theory  of  curves 
and  surfaces  that  does  not  assume  differentiability. 
Initial  results  have  been  obtained  for  plane  curves, 
as  described  below;  extensions  to  space  curves  and 
to  surfaces  are  in  progress.  The  concepts  are  also 
directly  applicable  to  digital  curves  and  surfaces. 

Let  5  be  a  subset  of  the  plane.  A  line  /  is  called 
a  line  of  support  of  5  if  5  lies  in  one  of  the  two 
closed  halfplanes  defined  by  1.  We  call  S  supported 
if  there  exists  a  line  of  support  of  S  through  every 
point  of  5.  A  set  S  is  supported  iff  it  is  contained 
in  the  boundary  of  its  closed  convex  hull;  hence  a 
closed,  bounded,  connected  supported  set  must  be 
an  arc  (or  simple  closed  curve).  A  supported  arc 
S  has  (finite,  non-zero)  one-sided  derivatives  at  ev¬ 
ery  point,  and  is  differentiable  at  a  point  P  iff  the 
line  of  support  of  5  at  P  is  unique.  (Note  that  the 
fact  that  S  is  an  arc,  and  its  differentiability  prop¬ 
erties,  were  not  assumed;  they  are  consequences  of 
supportedness.)  If  S  has  a  unique  line  of  support  at 
every  non-endpoint,  its  curvature  is  defined  at  every 
non-endpoint  and  has  constant  sign,  and  its  total 
absolute  turn  is  at  most  360°. 

An  arc  (not  necessarily  simple)  is  called  tame  if  it  is 
the  concatenation  of  a  finite  set  of  supported  arcs; 
for  example,  a  polygonal  arc  is  tame.  A  nonendpoint 
of  a  tame  arc  which  is  not  interior  to  any  supported 
subarc  is  called  an  inflection]  if  the  arc  is  differen¬ 
tiable  its  curvature  must  change  sign  at  such  a  point. 
A  tame  arc  can  have  only  finitely  many  inflections, 
and  its  total  absolute  turn  must  be  finite. 

13  Curve  Recognition  and 
Reconstruction[Weiss,  1995;  Weiss, 
1996] 

It  is  well  known  there  are  no  geometric  invariants  of 
a  projection  from  3D  to  2D.  However,  given  some 
modeling  assumptions  about  the  3D  object,  such  in¬ 
variants  can  be  found.  The  modeling  assumptions 
should  be  sufficiently  strong  to  enable  us  to  find  such 
invariants,  but  not  stronger  than  necessary.  We  have 
developed  such  modeling  assumptions  for  general  3D 
curves  under  affine  projection.  We  show  that  if  we 
know  one  of  the  two  affine-invariant  curvatures  at 
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each  point  of  the  curve,  we  can  derive  the  other  one 
from  its  image.  We  can  also  derive  the  point  corre¬ 
spondence  between  the  curve  and  the  image. 

There  has  been  considerable  work  recently  on  the 
problem  of  reconstruction  of  3D  point  sets  from  two 
images,  taken  by  uncalibrated  cameras.  However, 
the  point  correspondence  has  to  be  given.  When 
we  deal  with  reconstruction  of  curves  rather  than 
points,  while  we  need  the  correspondence  between 
curves,  this  is  an  easier  problem  because  curves  are 
far  fewer  and  more  distinctive  than  points.  We  have 
derived  a  simple  and  general  reconstruction  method, 
based  on  an  invariant  coordinate  system.  We  have 
applied  it  to  non-coplanar  conics  and  to  combina¬ 
tions  of  a  3D  conic  with  points.  3D  cubics  can  also 
be  handled.  Unlike  previous  work,  we  do  not  need 
to  know  the  epipolar  geometry;  we  recover  it  from 
the  images. 

14  Texture  Segmentation  and 
Description 

[Ojala  and  Pietikainen,  1996;  Ojala, 
1996] 

A  method  has  been  developed  for  unsupervised  tex¬ 
ture  segmentation  utilizing  a  statistical  test  based 
on  distributions  of  feature  values  to  compare  neigh¬ 
boring  image  regions.  The  distributions  of  simple  lo¬ 
cal  binary  patterns  combined  with  a  complementary 
contrast  meeisure  are  used  as  texture  features  and  a 
log-likelihood  test,  the  G  test,  is  used  for  comparing 
feature  distributions.  A  robust  split-and-merge  type 
algorithm  is  developed  for  coarse  image  segmenta¬ 
tion.  A  pixelwise  classification  scheme  with  feature 
distributions  of  neighboring  regions  as  texture  mod¬ 
els  is  then  used  to  improve  the  localization  at  region 
boundaries.  Our  method  does  not  require  any  prior 
knowledge  about  the  number  of  textures  or  regions 
in  the  image.  In  experiments  the  method  provides 
very  good  segmentations  for  various  types  of  texture 
mosaics  and  natural  scenes.  The  same  set  of  param¬ 
eter  values  is  used  in  all  the  experiments.  General¬ 
izations  of  the  method,  e.g.  to  utilize  other  texture 
features,  multiscale  information,  color  features,  and 
combinations  of  multiple  features,  are  also  possible. 

A  multichannel  approach  to  texture  description  is 
realized  by  approximating  joint  occurrences  of  mul¬ 
tiple  features  with  marginal  distributions,  as  1-D 
histograms,  and  combining  similarity  scores  for  1- 
D  histograms  into  an  aggregate  similarity  score.  A 
stepwise  feature  selection  algorithm  is  used  to  choose 
the  best  feature  combination  in  a  particular  dimen¬ 
sion.  This  approach  gave  excellent  results  in  a  classi¬ 
fication  problem  which  involved  fifteen  fine-grained 
textures  from  the  Brodatz  album.  The  statistical 
test  used  for  measuring  the  similarity  of  sample  and 


prototype  histograms  is  not  crucial  in  terms  of  over¬ 
all  performance;  comparable  classification  results  are 
obtained  with  a  variety  of  tests.  Further,  the  experi¬ 
mental  results  prove  that  choosing  the  proper  quan¬ 
tization  of  the  feature  space  is  relatively  easy. 

15  Bibliography  on  Image  Analysis 
and  Computer  Vision  [Rosenfeld, 
1997] 

We  have  compiled  a  bibliography  of  nearly  2150 
references  related  to  computer  vision  and  image 
analysis,  arranged  by  subject  matter,  which  ap¬ 
peared  in  1996.  The  topics  covered  include  compu¬ 
tational  techniques;  feature  detection  and  segmen¬ 
tation;  image  and  scene  analysis;  two-dimensional 
shape;  pattern;  color  and  texture;  matching  and 
stereo;  three-dimensional  recovery  and  analysis; 
three-dimensional  shape;  and  motion.  A  few  refer¬ 
ences  are  also  given  on  related  topics,  including  ge¬ 
ometry  and  graphics,  compression  and  processing, 
sensors  and  optics,  visual  perception,  neural  net¬ 
works,  artificial  intelligence  and  pattern  recognition, 
as  well  as  on  applications. 
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Abstract 

In  the  battlefield  of  the  future,  advanced  sensor  sys¬ 
tems  will  likely  make  the  ditference  between  the  win¬ 
ners  and  the  losers  —  both  on  the  manufacturing  and 
the  military  battlefields. 

This  report  is  on  the  Lehigh/Columbia  MURJ  con¬ 
tract,  which  is  a  basic  research  project  with  its  focus 
on  sensors  for  manufacturing.  Its  basic  research  na¬ 
ture  means  that  much  of  it  can  be  applied  for  battle¬ 
field  awareness  or  even  medical  applications.  For  ex¬ 
ample,  Dr.  Nayar’s  basic  scientific  work  on  imaging 
sensors  with  non-traditional  optics  led  to  the  devel¬ 
opment  of  the  omnidirectional  sensor  which  has  clear 
application  in  surveillance  and  tracking. 

This  report  provides  short  summaries  of  our  signif¬ 
icant  contributions,  with  citations  to  related  papers. 
Length  of  presentation  herein  does  not  reflect  level  of 
effort  nor  our  view  of  its  significance  —  many  of  the 
most  important  areas  have  papers  elsewhere  in  these 
proceedings. 

1  Omnidirectional  Video  Cameras 

Conventional  video  cameras  are  often  limited  in  their 
fields  of  view;  the  image  represents  a  small  section  of 
the  scene  that  lies  in  front  of  the  camera.  Our  interest 
is  in  the  creation  of  video  cameras  that  can  “see”  in  all 
directions.  Our  omnidirectional  cameras  are  based 
on  the  concept  of  catadioptric  image  formation;  the 
use  of  mirrors  in  addition  to  lenses  to  form  an  im¬ 
age  with  an  unusually  large  field  of  view,  while  main¬ 
taining  a  fixed  viewpoint  [Nayar-97].  We  have  de¬ 
veloped  several  prototypes  of  omnidirectional  cam¬ 
eras  that  use  parabolic  mirrors.  In  addition,  we  have 
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solved  the  theoretical  problem  of  deriving  the  en¬ 
tire  class  of  catadioptric  sensors  that  yield  a  single 
viewpoint  [Nayar  and  Baker-97].  The  fixed  view¬ 
point  in  our  cameras  enables  the  user  to  create  per¬ 
spective  (conventional)  images  of  the  scene  from  an 
omnidirectional  image  for  any  user-selected  view¬ 
ing  direction  and  magnification  [Nayar  and  Peri-97]. 
An  interaetive  software  system  has  been  developed 
that  allows  a  user  to  generate  about  a  dozen  perspee- 
tive  video  ports  from  a  single  omnidireetional  video 
stream  [Peri  and  Nayar-97]  using  no  more  than  a  PC. 

The  implications  of  omnidirectional  video  for  vi¬ 
sion  are  numerous.  We  are  presently  developing  a 
small  number  of  our  cameras  for  use  by  others  in  the 
DARPA  lU  eommunity.  The  intended  applications 
lie  within  the  area  of  video  surveillanee,  where  omni¬ 
directional  cameras  have  the  advantage  that  they  do 
not  have  much  of  a  blind  spot.  More  information  can 
be  found  in  the  Pl-report  on  a  new  project  centered 
on  developing/applying  omnidirectional  video:  sys¬ 
tem  refinements,  higher-level  visual  processing  al¬ 
gorithms  such  as  motion  estimation,  object  tracking, 
object  stabilization,  and  super-resolution  algorithms 
that  can  be  directly  applied  to  omnidirectional  video. 
Ultimately,  the  use  of  two  or  more  omnidirectional 
cameras  for  full-view  3-D  reconstruction  of  scenes 
will  be  pursued. 

2  Rational  Filters  for  Passive  Depth  from 
Defocus 

Last  lUW  we  demonstrated  a  significant  break¬ 
through  in  range  image  acquisition— a  defocus-based 
system  that  produced  512x512  depth  maps  at  30fps. 
The  active  illumination  limits  its  use  outdoors. 

Our  past  successes  have  motivated  us  to  develop  a 
general  purpose  sensor  that  captures,  simultaneously, 
two  images  of  the  scene  taken  at  different  focus  set¬ 
tings.  This  bifocal  sensor  is  comprised  of  a  telecen- 
tric  lens  [Watanabe  and  Nayar-95],  a  beam-splitting 
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prism,  and  two  CCD  image  sensors.  The  telecen- 
tric  lens  ensures  that  the  magnification  of  the  sensor 
is  invariant  to  defocus.  The  two  CCD  sensors  are 
positioned  to  provide  slightly  different  effective  fo¬ 
cal  lengths.  As  a  result,  two  images  of  the  scene  are 
sensed  that  correspond  to  different  focus  settings. 

A  central  component  in  this  passive  depth  from  de¬ 
focus  system  is  a  new  class  of  broadband  rational 
operators  that,  when  used  together,  provide  invari¬ 
ance  to  scene  texture  and  produce  accurate  and  dense 
depth  maps  [Watanabe  and  Watanabe-96,  Watanabe 
and  Watanabe-98].  Since  the  operators  are  broad¬ 
band,  a  very  small  number  of  them  are  sufficient 
for  depth  estimation  of  scenes  with  complex  textural 
properties.  In  addition,  a  depth  confidence  measure 
has  been  derived  that  can  be  computed  from  the  out¬ 
puts  of  the  operators.  This  confidence  measure  per¬ 
mits  further  refinement  of  computed  depth  maps.  Ex¬ 
periments  were  conducted  on  both  synthetic  and  real 
scenes  to  evaluate  the  performance  of  the  proposed 
operators.  The  depth  detection  gain  error  is  less  than 
1%,  irrespective  of  texture  frequency.  Depth  accu¬ 
racy  is  found  to  be  0.5  ~  1.0%  of  the  distance  of  the 
object  from  the  imaging  optics. 

We  are  presently  developing  algorithms  that  use  the 
bifocal  sensor  for  real-time  depth  map  computation, 
real-time  depth-based  image  segmentation,  and  real¬ 
time  depth-based  image  compression. 

2.1  CAD  Model  Acquisition  from  Multiple 
Range  Images 

Some  aspects  of  our  project  are  more  closely  tied  to 
the  manufacturing  area;  CAD  model  acquisition  is 
one  of  them.  Direct  recovery  of  an  accurate  CAD 
model  from  an  object  would  also  be  beneficial  for 
battlefield  simulation  development  and  for  part  re¬ 
placement.  Furthermore  the  basic  issues  of  where- 
to-sense  and  how  to  combine  the  polygonal  data  in 
a  consistent  manner  transcend  the  CAD  arena. 

We  have  developed  a  new  method  for  automatically 
constructing  a  CAD  model  of  an  unknown  object 
from  range  images.  The  method  is  an  incremental 
one  that  interleaves  a  sensing  operation  that  acquires 
and  merges  information  into  the  model  with  a  plan¬ 
ning  phase  to  determine  the  next  sensor  position  or 
“view”  [Reed  and  Allen-97a,  Reed  and  Allen-97b, 
Reed  et  a/. -97b,  Reed  et  a/. -97a].  This  is  accom¬ 
plished  by  integrating  a  system  for  3-D  model  ac¬ 
quisition  with  a  sensor  planner.  The  model  acquisi¬ 
tion  system  provides  facilities  for  range  image  acqui¬ 
sition,  solid  model  construction  and  model  merging; 
both  mesh  surface  and  solid  representations  are  used 


to  build  a  model  of  the  range  data  from  each  view, 
which  is  then  merged  with  the  model  built  from  pre¬ 
vious  sensing  operations.  The  planning  system  uti¬ 
lizes  the  resulting  incomplete  model  to  plan  the  next 
sensing  operation  by  finding  a  sensor  viewpoint  that 
will  improve  the  fidelity  of  the  model.  Experimen¬ 
tal  results  can  be  found  in  these  proceedings  [Reed 
et  fl/.-97b]  for  complex  parts  that  include  polygonal 
faces,  curved  surfaces,  and  large  self-occlusions. 

In  the  first  phase  of  our  method,  we  acquire  a  range 
image,  model  it  as  a  solid,  and  merge  it  with  any  pre¬ 
viously  acquired  information.  This  phase  motivates 
the  generation  of  a  topologically  correct  solid  model 
at  each  stage  of  the  modeling  process,  which  allows 
the  use  of  well-defined  geometric  algorithms  to  per¬ 
form  the  merging  task  and  additionally  supports  the 
view  planning  process.  The  second  phase  plans  the 
next  sensor  orientation  so  that  each  additional  sens¬ 
ing  operation  recovers  object  surface  that  has  not 
yet  been  modeled.  Using  this  planning  component 
makes  it  possible  to  reduce  the  number  of  sensing  op¬ 
erations  to  recover  a  model:  systems  without  plan¬ 
ning  typically  utilize  as  many  as  70  range  images, 
with  significant  overlap  between  them.  This  concept 
of  reducing  the  number  of  scans  is  important  for  tasks 
such  as  3-D  FAX  where  the  sensing  process  may  add 
considerable  time.  The  result  is  a  3-D  CAD  model  of 
the  object. 

For  each  range  scan,  a  mesh  surface  is  formed  and 
“swept”  to  create  a  solid  volume  model  of  both 
the  imaged  object  surfaces  and  the  occluded  vol¬ 
ume.  This  is  done  by  applying  an  extrusion  operator 
to  each  triangular  mesh  element,  sweeping  it  along 
the  vector  of  the  rangefinder’s  sensing  axis,  until  it 
comes  in  contact  with  a  far  bounding  plane.  The 
result  is  a  5-sided  triangular  prism.  A  regularized 
union  operation  is  applied  to  the  set  of  prisms  which 
produces  a  polyhedral  solid  consisting  of  three  sets 
of  surfaces:  a  mesh-like  surface  from  the  acquired 
range  data,  a  number  of  lateral  faces  equal  to  the 
number  of  vertices  on  the  boundary  of  the  mesh  de¬ 
rived  from  the  sweeping  operation,  and  a  bounding 
surface  that  caps  one  end.  Each  surface  is  tagged  as 
“seen”  or  “occlusion”  for  the  sensor  planning  phase 
that  follows. 

Each  successive  sensing  operation  will  result  in  new 
information  that  must  be  merged  with  the  current 
composite  model.  The  merging  process  itself  starts 
by  initializing  the  composite  model  to  be  the  entire 
bounded  space  of  our  modeling  system.  The  infor¬ 
mation  determined  by  a  newly  acquired  model  from 
a  single  viewpoint  is  incorporated  into  the  compos¬ 
ite  model  by  performing  a  regularized  set  intersection 
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operation  between  the  two.  The  intersection  opera¬ 
tion  must  be  able  to  correctly  propagate  the  surface- 
type  tags  from  surfaces  in  the  models  through  to  the 
composite  model. 

Critical  to  any  incremental  method  is  planning  the 
next  view  of  the  object  to  reduce  uncertainty.  What  is 
desired  is  a  method  that  takes  known  self-occlusions 
into  account,  and  yet  does  not  need  to  discretize  the 
sensing  positions  and  compute  an  image  for  each  of 
them.  Our  method  is  able  to  select  a  specific  target 
to  be  imaged  which  will  reduce  the  uncertainty  and 
is  also  able  to  avoid  self-occlusion  problems.  Once 
an  unoccluded  sensor  position  for  the  specified  sur¬ 
face  has  been  determined,  it  may  then  be  scanned, 
modeled,  and  integrated  with  the  composite  model. 
Our  method  is  target-driven  and  performed  in  con¬ 
tinuous  space.  As  the  incremental  modeling  process 
proceeds,  regions  that  require  additional  sensing  can 
be  guaranteed  of  having  an  occlusion  free  view  from 
the  sensor  if  one  exists.  Other  viewing  constraints 
may  also  be  included  in  the  sensor  planning  such  as 
sensor  field  of  view,  resolution,  and  standoff  distance 
of  the  sensor. 

2.2  Dynamic  Sensor  Planning 

We  have  extended  our  Machine  Vision  Planning 
(MVP)  system  [Tarabanis  et  al. -94,  Tarabanis  et  ai- 
95a,  Tarabanis  et  n/.-95b]  which  automatically  com¬ 
putes  viewpoints  for  monitoring  objects  and  features 
in  a  robot  work-cell,  to  function  in  an  environment 
in  which  objects  are  moving  [Abrams-97].  There  are 
three  key  elements  of  this  research.  The  first  com¬ 
ponent  is  the  computation  of  the  volumes  swept  out 
by  moving  objects  over  a  time  interval.  This  permits 
the  determination  of  occluded  regions  during  that 
time  interval.  We  have  developed  a  new  algorithm 
for  swept  volume  computations,  handling  arbitrary 
polyhedral  objects  moving  through  arbitrary  trajecto¬ 
ries.  We  have  also  developed  a  new  algorithm  for  the 
computation  of  the  camera  positions,  orientations, 
and  optical  settings  to  be  used  during  these  time  in¬ 
tervals  for  robotic  inspection  and  monitoring  tasks. 
These  algorithms  may  have  further  application  in  au¬ 
tomated  surveillance  applications  where  camera  po¬ 
sitions  have  to  be  chosen  a  priori.  The  conflicting  na¬ 
ture  of  the  viewing  constraints  (e.g.  resolution  and 
field-of-view  tend  to  want  different  solutions)  has  led 
us  to  develop  a  new  method  for  viewpoint  computa¬ 
tion.  This  method  [Abrams  et  al. -96]  effectively  de¬ 
composes  the  constraints  of  focus,  resolution,  field- 
of-view,  and  visibility  into  satisfiable  subsets  that  can 
then  be  efficiently  searched  to  reach  an  overall  solu¬ 
tion.  A  five-degree-of-freedom  Cartesian  robot  car¬ 


rying  a  CCD  camera  in  a  hand/eye  configuration  and 
surrounding  the  work-cell  of  a  Puma  560  robot  has 
been  constructed  for  performing  sensor  planning  ex¬ 
periments.  The  results  of  these  experiments,  demon¬ 
strating  the  use  of  this  system  in  a  robot  work-cell, 
are  described  in  detail  in  [Abrams-97]. 

We  briefly  describe  one  experiment  here.  In  circuit 
board  assembly  inspection,  certain  locations  on  the 
IC  board  need  to  be  visually  monitored  at  all  times, 
even  though  possibly  occluded  by  a  moving  robot  in 
the  workcell.  Our  system  models  the  workcell  and 
any  motion  in  it,  creating  swept  3-D  volumes  for  any 
object  in  the  workcell  whose  motion  is  known.  Us¬ 
ing  these  swept  objects,  which  can  be  thought  of  as  a 
temporal  occlusion  objects,  we  can  compute  the  vis¬ 
ibility  volumes  for  each  location  that  is  to  be  mon¬ 
itored.  Optical  constraints  are  then  solved  for  and 
merged  to  find  a  correct  viewpoint  if  one  exists.  If  a 
single  view  is  not  sufficient,  the  volumes  are  decom¬ 
posed  to  find  intermediate  viewpoints  that  suffice. 

3  Parametric  Feature  Detection 

In  [Baker  et  a/. -98a,  Baker  et  a/. -98b,  Nayar  et  al- 
96b]  we  have  developed  an  algorithm  to  automat¬ 
ically  construct  a  feature  detector  for  an  arbitrary 
parametric  feature.  Our  work  has  resulted  in  a  gen¬ 
eral  framework  for  feature  detection;  it  permits  a 
user  to  input  a  parametric  feature  model  and  pro¬ 
vides  a  feature  detector  that  is  optimal  in  the  corre¬ 
lation  sense.  Besides  the  unparalleled  generality  of 
the  technique,  the  other  major  contribution  is  the  in¬ 
corporation  of  realistic  models  of  optical  and  sens¬ 
ing  effects,  which  we  argue  to  be  vital  in  order  to 
obtain  a  high  level  of  feature  detection  robustness. 
In  the  algorithm,  each  feature  is  represented  as  a 
densely  sampled  parametric  manifold  in  a  low  di¬ 
mensional  subspace  of  a  Hilbert  space.  Detection  is 
then  performed  by  projecting  the  brightness  distri¬ 
bution  around  each  image  pixel  into  the  subspace. 
If  the  projection  lies  sufficiently  close  to  the  fea¬ 
ture  manifold,  the  feature  is  detected  and  the  location 
of  the  closest  point  on  the  manifold  yields  the  fea¬ 
ture  parameters.  To  find  the  closest  manifold  point 
sufficiently  quickly,  we  employ  parameter  reduction 
by  normalization,  dimension  reduction  [Nayar  et  al- 
96b],  pattern  rejection  [Baker  and  Nayar-96],  and 
heuristic  search  [Baker  et  a/. -98b]. 

We  have  applied  the  algorithm  to  construct  detec¬ 
tors  for  5  parametric  features,  namely,  step  edge,  roof 
edge,  line,  comer,  and  circular  disc.  Using  these  5 
detectors,  we  have  conducted  detailed  experiments 
to  demonstrate  the  efficacy  of  the  technique  and  in 
particular  have  found  it  to  perform  comparably  to 
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the  edge  detectors  of  Canny  [Canny-86]  and  Nalwa- 
Binford  [Nalwa  and  Binford-86].  As  future  work,  we 
are  interested  in  the  analysis  of  weighting  functions 
that  optimize  the  performance  of  feature  detection. 
This  will  involve  a  theoretical  analysis  of  the  formu¬ 
lation  of  feature  detection  measures  as  well  as  the  nu¬ 
merical  optimization  of  weighting  functions. 

4  Deformable  models 

Deformable  models  are  used  for  tracking  and  object 
recognition.  We  are  investigating  inductive  learning 
of  deformable  models.  Such  models  find  the  shape 
that  best  describes  the  boundary  of  some  object  in 
an  image.  That  is,  given  an  image,  some  objective 
(“energy”)  function  is  optimized  over  a  family  of 
parametrized  shapes. 

In  cluttered  environments,  simple  gradient  strength 
at  the  shape  boundary  does  not  distinguish  the  de¬ 
sired  object  from  others.  Other  image  and  shape 
qualities  must  guide  segmentation,  and  training  on 
domain  data  determines  how.  Prior  work  with  train¬ 
ing  has  used  particular  features  besides  gradient 
strength,  but  none  has  examined  training  in  a  gen¬ 
eral  setting.  Our  work  establishes  a  methodology 
for  selecting  image  features  and  objective  functions 
learned  from  them,  and  for  measuring  their  effective¬ 
ness  in  order  to  find  the  best  way  to  optimize  a  param¬ 
eterized  shape  within  cluttered  imagery  in  a  domain. 

In  our  work,  the  input  to  the  training  is  a  large  set 
of  presegmented  images  in  which  the  sought  shape 
(“ground  truth”)  is  known.  A  probability  model  of 
the  chosen  image  features  describes  a  family  of  ob¬ 
jective  functions.  Given  the  model,  the  training  data 
determines  the  objective  function.  The  image  fea¬ 
tures  are  quantities  that  relate  the  shape  to  the  im¬ 
age,  and  include  the  usual  intensity  and  gradient  mea¬ 
sures,  but  they  are  potentially  unlimited  and  may 
even  be  non-local.  For  example,  one  such  non-local 
feature  could  measure  the  distance  or  orientation  to  a 
remote  but  reliable  landmark. 

We  then  evaluate  the  quality  of  the  features-plus- 
fiinction  combination.  This  is  done  statistically;  we 
generate  random  perturbations  of  the  ground  truth 
shape  in  presegmented  images,  and  then  examine  the 
relationship  between  the  amount  of  perturbation  and 
the  perturbed  shape’s  objective  function  value.  Ini¬ 
tial  results  are  on  medical  imagery  but  they  will  also 
be  applied  in  the  manufacturing  domain. 


5  Warping-based  fusion  & 

Super-Resolution 

Given  an  image  sequence  from  one  or  more  sen¬ 
sors,  it  seems  natural  to  want  to  fuse  the  data  to 
produce  new  and  better  estimates.  Image  fusion  al¬ 
most  always  requires  warping  to  align  the  data  be¬ 
fore  the  fusion  step.  One  of  our  ongoing  projects 
is  improved  algorithms  for  this  warping  and  vari¬ 
ous  algorithms  for  image  fusion.  The  techniques 
has  been  applied  to  warp  4  camera  views  to  com¬ 
pute  polarization  images[Zhou-96],  and  for  develop¬ 
ing  super-resolution  images  [Chiang  and  Boult-97b, 
Chiang  and  Boult-97a,  Chiang  and  Boult-96a]. 

There  are,  of  course,  some  fundamental  limits  on 
what  this  combination  can  do.  If  the  images  were 
noise-ffee,  focused  and  Nyquist  sampled,  then  multi¬ 
ple  images  from  a  single  viewpoint  would  add  noth¬ 
ing.  However,  things  move,  images  are  blurred  and 
with  the  noise,  and  aliasing  present  in  images,  de¬ 
blurring  is  unstable.  If  time  is  not  a  concern,  then  ex¬ 
isting  techniques  can  address  these  problems  formu¬ 
lating  fusion  as  millions  of  coupled  equations.  For 
industrial  or  battlefield  use,  the  fusion  needs  to  be  fast 
-  a  four  fold  increase  in  resolution  that  takes  4  hours  is 
not  likely  to  get  used.  Other  researchers,  e.g.  [Huang 
and  Tsai-84,  Irani  and  Peleg-91,  Irani  and  Peleg-93, 
Bascle  et  a/. -96],  have  addressed  this  problem  mak¬ 
ing  various  accuracy /time  tradeoffs. 

We  have  developed  a  technique  that  can  warp/fuse 
images  in  seconds  [Chiang  and  Boult-96a]  with  qual¬ 
ity  superior  to  the  best  previous  work.  Unlike  pre¬ 
vious  approaches  our  algorithm  is  direct,  not  iter¬ 
ative.  The  algorithm  performs  matching,  warping, 
and  fusion  followed  by  an  explicit  (optional)  de¬ 
blurring/sharpening  step.  We  showed  that  image 
warping  techniques  can  have  a  strong  impact  on  the 
overall  quality.  By  coupling  a  degradation  model 
of  the  imaging  system  directly  into  our  integrating 
resampler[Chiang  and  Boult-96b],  we  can  better  ap¬ 
proximate  the  warping  characteristics  of  real  sen¬ 
sors,  which  in  turn  improves  the  quality  of  super¬ 
resolution  images.  Quantitative  evaluation  is  under¬ 
way,  passing  the  resulting  super-resolution  images  to 
recognition  systems  and  comparing  recognition  per¬ 
formance.  Initial  results  support  our  claims  of  supe¬ 
riority. 

Until  now,  super-resolution  algorithms  required  the 
images  be  taken  under  the  same  illumination.  This  is 
acceptable  for  very  short  video-sequences  from  a  sin¬ 
gle  viewpoint  in  a  controlled  environment.  For  mil¬ 
itary  situations,  there  was  a  desire  to  obtain  super¬ 
resolution  images  taken  over  significant  time  spans 
and  different  viewpoints.  When  intensities  change. 
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simple  fusion  cannot  estimate  them. 

We  recently  demonstrated  a  new  approach  to  super¬ 
resolution  which  uses  edge  models,  local  blur  esti¬ 
mates  and  intensity  reconstruction  from  these  mod¬ 
els  to  circumvent  the  problems  of  lighting  variations 
[Chiang  and  Boult-97a,  Chiang  and  Boult-97b].  We 
fuse  estimates  of  the  sub-pixel  edge  and  blur  mod¬ 
els  and  then  reconstruct  the  desired  super-resolution 
intensity  image.  A  remaining  significant  problem 
is  the  required  sub-pixel  matching  which  is  more 
difficult  with  changing  illumination,  and  even  more 
so  with  viewpoint  changes.  We  are  currently  do¬ 
ing  a  quantitative  evaluation  of  both  of  our  super¬ 
resolution  algorithms  in  domains  where  the  match¬ 
ing  is  not  a  significant  problem.  This  edge/blur  based 
super-resolution  approach  represents  a  fundamental 
advance  for  warping-based  image  fusion. 

6  Region  plus  Wavelet-based  Fusion 

Not  all  fusion  can  be  accomplished  at  the  “pixel” 
level.  We  have  studied  a  new  fusion  concept  [Zhong 
and  Blum-97]  which  combines  the  traditional  pixel- 
level  fusion  with  feature-level  fusion.  Under  this 
concept,  we  developed  a  new  region-based  fusion  al¬ 
gorithm  in  which  the  wavelet  transform,  edge  detec¬ 
tion  and  image  segmentation  are  all  combined.  Ex¬ 
periments  [Zhong  and  Blum-97]  show  that  this  algo¬ 
rithm  works  well  in  many  situations.  The  same  ap¬ 
proach  works  equally  well  for  fusion  of  different  sen¬ 
sors,  the  same  sensor  with  different  parameters  and 
scenes  with  different  lighting.  One  demonstration, 
see  [Zhong  and  Blum-97]  in  these  proceedings,  uses 
a  pair  of  visual  and  lOOGHz  radiometric  (i.e.  mm- 
wave  radar)  images.  The  visual  image  shows  loca¬ 
tion  and  appearance  of  the  people  while  the  radio- 
metric  image  shows  the  existence  of  a  gun.  From 
the  fused  image,  one  can  clearly  and  promptly  see 
who,  if  anyone,  concealed  a  gun  beneath  their  clothes 
[Currie  et  al. -95,  Currie  et  al. -96,  Zhong  and  Blum- 
97]. 

7  Fusing  Decisions 

Sensory  data  will  ultimately  be  used  to  make  some 
decision  on  the  state  of  some  device,  part,  or  pro¬ 
cess.  Given  that  mechanisms  exist  for  making  indi¬ 
vidual  decisions  based  on  the  measurements  of  each 
of  the  individual  sensors  acting  alone,  methods  for 
combining  the  individual  decisions  are  clearly  of  in¬ 
terest.  These  decision  fusion  methods,  [Blum  et  al- 
97],  should  account  for,  among  other  things,  the  dif¬ 
ferences  in  the  reliabilities  of  the  decisions  made  by 
each  of  the  different  sensors.  In  our  research  the  ob¬ 


servations  may  be  statistically  dependent  across  sen¬ 
sors  which  is  more  realistic  than  past  “independent” 
work.  In  [Blum-95]  we  proposed  and  analyzed  an 
adaptive  scheme  for  learning  the  appropriate  deci¬ 
sion  combining  rule  and  prove  our  combining  rules 
will  convergence  to  an  optimum  solution  (minimum 
probability  of  error).  Numerical  tests  indicate  that 
convergence  typically  occurs  quickly. 

There  are  clear  advantages  to  tuning  the  individual 
decision  mechanisms  at  the  same  time  that  the  deci¬ 
sion  combining  scheme  is  being  adapted.  We  have 
developed  an  approach  [Deans  and  Blum-96,  Deans- 
96]  which  enables  the  application  of  existing  adap¬ 
tive  filtering  and  processing  schemes  and  tests  of 
this  approach  are  promising.  Initially  our  approach 
was  limited  to  a  specific  class  of  observation  mod¬ 
els  which  include  the  difficult  case  where  the  sig¬ 
nals  of  interest  are  weak  and  the  observations  are  cor¬ 
rupted  by  either  additive  Gaussian  or  non-Gaussian 
noise.  More  recently  we  have  formulated  an  ap¬ 
proach  which  can  be  applied  to  a  more  general  set  of 
problems  [Shen-97]. 

We  have  begun  developing  a  theory  for  the  optimum 
decision  rules  for  the  individual  decision  makers  in 
decision  combining  schemes  with  dependent  obser¬ 
vations  such  a  theory.  In  our  recent  research  [Vikalo- 
97,  Vikalo  and  Blum-97b,  Vikalo  and  Blum-97a]  the 
noise  was  modeled  as  a  mixture  of  Gaussian  distri¬ 
butions,  a  general  and  practical  model  for  impulsive 
noise.  A  criterion  of  Bayes  risk  is  adopted  for  cases 
with  fixed  fusion  rules.  The  optimum  sensor  tests  are 
shown  to  be  different  from  the  best  isolated  sensor 
tests  in  several  cases.  Further,  a  methodology  for  pre¬ 
dicting  the  form  of  the  optimum  sensor  tests  has  been 
developed,  see  [Vikalo  and  Blum-97b]. 

8  Closest  Point  Search  in  High  Dimensions 

Closest  point  search  is  an  important  component  of 
many  recognition  techniques  in  computational  vi¬ 
sion.  For  instance,  in  appearance  based  recognition 
[Murase  and  Nayar-95a],  positioning,  tracking,  and 
inspection  [Nayar  et  a/. -96a],  the  point  in  eigenspace 
closest  to  a  novel  input  point  identifies  an  appear¬ 
ance  which  is  most  similar  to  the  appearance  that  the 
novel  point  represents.  Existing  closest  point  algo¬ 
rithms,  such  as  indexing  [Califano  and  Mohan-91], 
k-d  tree,  and  R-tree  [Guttman-84]  are  not  efficient 
in  high  dimensional  spaces.  As  a  result  they  per¬ 
form  poorly  when  employed  for  appearance  recogni¬ 
tion,  since  eigenspaces  typically  have  more  than  1 5 
dimensions. 

We  have  developed  [Nene  and  Nayar-96]  a  new  algo- 
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rithm  which  substantially  improves  high  dimensional 
performance.  In  our  technique,  rather  than  search¬ 
ing  for  the  closest  point,  we  search  for  a  closest  point 
within  distance  e  from  the  novel  point.  This  is  ac¬ 
complished  by  partitioning  the  space  with  pairs  of 
hyperplanes,  one  pair  for  each  dimension  and  each 
hyperplane  placed  at  distance  e  from  the  novel  point. 
The  intersection  of  these  hyperplanes  gives  a  hyper¬ 
cube,  the  points  within  which  can  be  found  efficiently 
with  the  aid  of  a  precomputed  data  structure.  An 
exhaustive  search  is  then  performed  on  these  small 
number  of  hypercube  points  to  find  the  closest  point. 
In  [Nene  and  Nayar-96],  we  examined  the  complex¬ 
ity  of  this  algorithm  and  show  that,  for  commonly 
occurring  distributions,  the  complexity  is  roughly 
0{nde),  where  n  is  the  number  of  points  and  d  the 
number  of  dimensions.  Thus,  we  trade  exponential 
behavior  in  the  number  of  dimensions  to  linear  be¬ 
havior  in  number  of  points.  Extensive  benchmarks 
demonstrate  that  our  algorithm  outperforms  compet¬ 
itive  techniques  by  a  large  margin  [Nene  and  Nayar- 
85]. 

Closest  point  search  is  a  generic  problem  and  the 
newly  developed  technique  could  also  be  applied  in 
image  databases,  pattern  analysis,  and  various  types 
of  simulation. 

9  Visual  Gestural  Interfaces 

Last  lUW  we  reported  initial  work  on  visual  gestu¬ 
ral  interfaces.  Since  then  the  work  has  undergone 
evaluation  and  extensions.  This  work  takes  as  input 
images  of  an  uninstrumented  hand  against  a  natural 
background,  uses  standard  image  processing  hard¬ 
ware  to  segment  the  hand  via  color  matching,  tracks 
the  hand,  and  at  critical  junctures,  analyzes  the  pose 
via  a  neural  net  [Kjeldsen  and  Kender-96b,  Kjeld- 
sen  and  Kender-96a].  The  gestural  class  of  the  pose, 
and  the  positional  information  of  the  tracking,  can  be 
used  to  drive  a  standard  menu-based  user  interface: 
for  example,  selecting  a  menu,  pulling  it  down,  and 
“clicking”  on  an  item  by  gestures,  or  moving  and  re¬ 
sizing  the  windows  themselves. 

The  first  study  establishes  the  non-linearities  in¬ 
volved  in  displaying  the  tracked  cursor  position 
given  the  visual  input.  We  show  that  the  smoothing 
constraints  of  the  tracking  are  critically  dependent  on 
the  context  in  which  hand  motion  occurs.  Model¬ 
ing  the  cursor  as  a  physical  object  with  “mass”,  posi¬ 
tion,  and  velocity  meets  some  but  not  all  of  these  con¬ 
straints.  We  improved  the  system  by  detecting  vari¬ 
ous  contexts  and  dynamically  adjusting  the  smooth¬ 
ing  parameters  depending  on  apparent  user  intention. 
The  “force”  function,  which  transmits  visual  location 


to  cursor  position  via  a  sigmoidally  varying  “spring” 
constant,  depends  on  current  and  prior  positions  and 
velocities. 

In  the  second  study,  we  compared  the  usability  of 
the  visual  interface  with  that  of  standard  pointing  and 
clicking  in  several  ways.  First,  we  studied  the  ac¬ 
curacy  and  repeatability  of  visual  tracking  in  an  al¬ 
ternating  target  task.  The  new  smoothing  algorithm 
effectively  damps  out  the  majority  of  the  jitter  that 
is  present  in  the  raw  hand  position  data,  but  never¬ 
theless  tracks  fast  movements  very  well.  Next,  we 
evaluated  object  selection  performance  directly,  by 
measuring  the  length  of  time  needed  to  select  a  target 
by  pointing  at  it,  and  compared  this  time  to  that  us¬ 
ing  a  mouse.  Selection  time  was  measured  from  the 
moment  the  space  key  was  pressed  until  the  cursor 
had  been  inside  the  target  continuously  for  0.5  sec¬ 
onds.  The  mean  selection  time  for  free-hand  point¬ 
ing  was  1.91  seconds;  the  mean  selection  time  using 
the  mouse  was  1 .57.  These  times  include  the  0.5  sec¬ 
onds  within  the  target.  However,  free-hand  pointing 
time  drops  rapidly  with  increasing  target  size,  level¬ 
ing  out  at  around  1.2  seconds  for  larger  objects;  se¬ 
lection  time  with  a  mouse  drops  only  to  1 .3  seconds. 

Lastly,  we  developed  a  predictive  model  for  the  ac¬ 
curacy  of  our  free-hand  pointing  according  to  Fitts’ 
Law.  The  model  accurately  captured  both  hand  data 
and  mouse  data  and  allows  us  to  predict  system  per¬ 
formance  as  a  function  of  tracking  rate  and  track¬ 
ing  accuracy.  The  model  shows  that  random  jitter  in 
the  cursor  position  and  the  lag  caused  by  the  slow 
tracking  rate  are  sufficient  to  cause  the  long  selection 
times  for  small  objects.  The  model  indicated  that  ac¬ 
curacy  was  far  more  critical  than  speed;  with  very  lit¬ 
tle  noise  and  at  a  tracking  speed  attainable  with  off- 
the-shelf  hardware  in  a  few  years,  free-hand  pointing 
can  be  expected  to  be  approximately  the  same  as  for 
a  mouse,  slightly  better  for  large  objects,  and  slightly 
worse  for  small  ones.  Under  ideal  conditions  (i.e.  no 
tracking  lag  at  all),  gesture  has  the  potential  to  be  sig¬ 
nificantly  faster  than  using  a  mouse  for  objects  of  all 
sizes. 

9.1  Visual  Control  of  Grasping 

While  great  strides  have  been  made  in  robotic  hand 
design  and  a  number  of  working  dextrous  robotic 
hands  built,  the  reality  is  that  the  sensory  information 
required  for  dextrous  manipulation  lags  the  mechani¬ 
cal  capability  of  the  hands.  Accurate  and  high  band¬ 
width  force  and  position  information  for  a  multiple 
finger  hand  is  still  difficult  to  acquire  robustly.  Vi¬ 
sion  can  be  an  effective  sensing  modality  for  grasp¬ 
ing  tasks  and  can  serve  as  an  external  sensor  to  pro- 
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vide  control  information  for  devices  that  lack  in¬ 
ternal  sensing  or  that  would  require  extensive  re¬ 
engineering  to  provide  contact  and  force  sensing. 
Using  a  vision  system,  a  simple  uninstrumented  grip- 
per/hand  can  become  a  precision  device  capable  of 
position  and  possibly  even  force  control.  When  vi¬ 
sion  is  coupled  with  any  existing  internal  hand  sens¬ 
ing,  or  external  tactile  sensors,  it  can  provide  a  rich 
set  of  complementary  information  to  confirm  and 
quantify  internal  sensory  data,  as  well  as  monitoring 
a  task’s  progress  [Yoshimi  and  Allen-97]. 

We  have  implemented  a  set  of  real-time  vision  mod¬ 
ules  that  can  be  used  to  track  and  monitor  the  hand 
as  it  performs  a  task.  The  vision  system  is  used  to 
track  the  links  of  each  finger  of  the  hand  as  well  as 
monitor  contact  and  grasp  points  on  objects  in  the 
workspace.  We  have  also  added  tactile  sensors  to 
augment  the  capability  of  our  robotic  hand  [Allen  et 
a/. -97,  Allen  et  al.-96a.,  Allen  et  a/. -96b].  The  tactile 
sensors  cover  the  links  of  the  hand  as  well  as  the  pal¬ 
mar  surface.  The  tactile  sensor  system  can  be  used  to 
localize  contacts  on  the  surfaces  of  the  hand,  as  well 
as  determine  contact  forces.  The  tactile  pads  use  a 
capacitive  tactile  sensor  and  the  electronics  package 
is  mounted  on  the  robot  wrist  with  wiring  to  each  pad 
on  the  fingers  and  palm.  The  tactile  sensor  geometry 
on  each  finger  link  is  a  4x8  grid  with  each  capacitive 
cell  approximately  3  mm  by  3  mm  and  1  mm  spac¬ 
ing  between  tactile  elements  (tactels),  and  the  sen¬ 
sor  can  bend  to  the  curve  of  the  fingertip.  The  sen¬ 
sor  is  covered  with  a  compliant  elastomer  that  allows 
force  distributions.  The  robotic  hand  we  are  using  is 
the  Barrett  Hand,  which  is  a  three-fingered,  four  DOF 
hand  with  limited  sensing  capability.  It  has  a  limited 
amount  of  internal  strain  gauge  force  sensing  capa¬ 
bility  built  into  it,  and  the  tactile  and  vision  systems 
can  be  used  to  accurately  quantify  contact  forces  in 
conjunction  with  the  strain  gauge  system. 

A  number  of  experiments  have  been  performed  with 
this  system  to  characterize  the  ability  of  vision  and 
force/contact  sensors  to  support  grasping  tasks.  They 
include  integration  of  real-time  visual  trackers  in 
conjunction  with  internal  strain  gauge  sensing  to  cor¬ 
rectly  localize  and  compute  finger  forces,  determina¬ 
tion  of  contact  points  on  the  inner  and  outer  links  of 
a  finger  through  tactile  and  visual  sensing,  and  deter¬ 
mination  of  vertical  displacement  by  tactile  sensing 
for  a  grasping  task.  In  these  experiments,  the  vision 
system  reported  contact  points  that  were  within  2  mm 
of  the  actual  contact  points. 


10  Remote  control  of  sensors/actuators 

In  support  of  the  Laptop  Vision  System  we  have  de¬ 
veloped  a  set  of  software  drivers  allowing  remote 
control  of  actuators.  The  high-level  interface  is  web 
oriented  and  the  low  level  drivers  provide  sufficient 
intelligence  so  as  to  handle  reasonable  delays  in  the 
command  stream. 

The  first  layer  was  development  of  some  much- 
needed  software  to  generate  PWM  (pulse-width- 
modulated)  signals  directly  on  the  PC  parallel  port 
under  control  of  commands  received  on  the  PC  serial 
port.  Both  “locked  anti-phase”  and  the  more  diffi¬ 
cult  “sign-magnitude”  modes  of  PWM  generation  are 
supported  for  maximum  flexibility  in  driving  differ¬ 
ent  kinds  of  H-bridges.  First  order  prediction  of  the 
duty  cycle  with  respect  to  time  is  explicitly  supported 
so  that  busy  hosts  can  afford  to  be  a  little  late  in  send¬ 
ing  commands  on  the  serial  line,  yet  still  produce 
smooth  actuator  trajectories.  We  also  developed  a 
server  which  presents,  to  any  number  of  clients,  a 
high-level  TCP/IP  control  interface.  Clients  include 
two  GUI  interfaces  that  allow  one  to  manipulate  the 
direct-drive  motors  with  on-screen  slider  bars,  with 
radio-button  selections  and,  in  the  case  of  the  two- 
axis  motors  (such  as  the  SPM),  with  direct  2D  posi¬ 
tion  maps. 

A  parallel  development  has  been  the  TCP/IP  video 
server,  intended  principally  to  allow  clients  to  cap¬ 
ture  and  manipulate  logmap  images  in  real  time,  but 
designed  in  a  modular  fashion  that  will  easily  ac¬ 
commodate  many  kinds  of  filter.  The  server  handles 
the  sequencing  and  synchronizing  of  image  capturing 
and  the  filtering  pipelines. 

As  a  demonstration  of  the  servers,  we  developed 
a  WWW-program  that  allows  users  on  the  WWW 
(http://www-robotics.eecs.lehigh.edu: 8009/imp])  to 
point  our  video  camera  mounted  on  a  spherical  point¬ 
ing  motor.  The  default  filter  gives  a  logmap  view,  and 
images  can  be  passed  through  any  number  of  subse¬ 
quent  filters.  This  graphical  client  uses  calibration 
information  to  map  positions  on  the  current  image 
to  the  PWM  values  required  to  point  to  those  posi¬ 
tions.  Unlike  past  web-cams,  this  interfaces  adjusts 
the  view  to  what  the  user  specifies,  not  just  in  some 
user  specified  direction.  A  third  client  is  Richard 
Wallace’s  “Alice”  program  whose  Web  clients  can 
use  a  natural-language  interface  to  move  the  camera. 

11  Recovery  of  Textureless  Scenes 

It  is  a  widely  held  belief  in  computer  vision  that  tex¬ 
tureless  surfaces  cannot  be  recovered  using  passive 
measurement  techniques.  Well  known  methods  such 
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as  stereo  and  structure  from  motion  need  correspon¬ 
dences  between  points  in  order  to  enable  the  recov¬ 
ery  of  the  scene.  This  is  impossible  in  the  absence  of 
scene  texture.  Monocular  techniques  such  as  depth 
from  focus  and  depth  from  defocus  also  require  that 
the  scene  be  strongly  textured.  A  partial  solution  to 
the  problem  is  to  use  shape  from  shading,  but  it  is 
limited  by  strong  assumptions  in  order  to  make  the 
problem  tractable. 

All  the  methods  described  above  try  to  recover  the 
structure  of  the  scene  by  analyzing  a  set  two  dimen¬ 
sional  images.  We  formulate  a  novel  approach  to 
the  problem  by  viewing  the  process  of  image  forma¬ 
tion  as  a  fully  three-dimensional  mapping  [Sundaram 
and  Nayar-97].  We  model  the  process  by  which  the 
lens  encodes  the  three  dimensional  volume  behind 
the  lens  (which  we  define  to  be  the  Monocular  Visual 
Space  (MVS) )  with  the  structural  information  of  the 
scene  in  front  of  the  lens.  By  analyzing  the  properties 
of  the  MVS,  we  have  determined  necessary  and  suf¬ 
ficient  conditions  to  recover  from  the  MVS  the  struc¬ 
tural  information  of  the  scene.  These  conditions  led 
to  a  simple  procedure  to  recover  textureless  scenes. 
We  have  demonstrated  experimentally  the  recovery 
of  three  classes  of  textureless  surfaces:  planes,  lines 
and  paraboloids.  The  conditions  and  the  methods 
that  we  have  proposed  for  scene  recovery  are  gen¬ 
eral  in  nature  and  are  applicable  to  all  scenes  and  are 
not  limited  to  textureless  scenes.  Textureless  scenes 
merely  represent  the  worst  case  scenario  for  recov¬ 
ery. 

12  Reflectance  and  Texture  of  Real-World 
Surfaces 

The  appearance  of  real-world  textured  surfaces  de¬ 
pends  on  view,  illumination  and  the  scale  at  which 
the  texture  is  observed.  At  coarse  scale,  appear¬ 
ance  is  characterized  by  the  BRDF  (bidirectional  re¬ 
flectance  distribution  fiinction).  At  fine  scale,  ap¬ 
pearance  can  be  characterized  by  the  BTF  (bidirec¬ 
tional  texture  function).  The  problem  of  charac¬ 
terizing  surface  appearance  in  terms  of  the  BRDF 
and  BTF  has  been  addressed  in  [Dana  et  a/. -97]  and 
has  resulted  in  three  publicly  available  databases: 
a  BTF  measurement  database  with  texture  images 
from  over  60  different  samples,  each  observed  with 
over  200  combinations  of  viewing  and  illumination 
directions,  a  BRDF  measurement  database  with  re¬ 
flectance  measurements  from  the  same  samples  and 
a  BRDF  model  parameter  database  with  parame¬ 
ters  obtained  by  fitting  the  measured  data  to  two  re¬ 
cent  BRDF  models.  Specifically,  the  measurements 
are  fit  to  two  existing  analytical  representations:  the 


Oren-Nayar  model  [Oren  and  Nayar-95,  Nayar  and 
Oren-95]  for  surfaces  with  isotropic  roughness  and 
the  Koenderink  et  al.  decomposition  [Koenderink 
et  a/. -96]  for  both  anisotropic  and  isotropic  sur¬ 
faces.  These  databases  are  publicly  available  at 
www.es . Columbia. edu/CAVE/ curet. 

Exactly  how  well  the  BRDFs  of  real-world  surfaces 
fit  existing  models  has  remained  unknown  as  each 
model  is  typically  verified  using  a  small  number  (2  to 
6)  of  surfaces.  Our  BRDF  measurement  database  al¬ 
lows  us  to  evaluate  the  performance  of  existing  mod¬ 
els  as  well  as  future  models.  The  BRDF  parame¬ 
ter  database  is  a  concise  representation  of  the  mea¬ 
surements  that  can  be  directly  used  for  both  image 
analysis  and  image  synthesis.  The  BTF  database 
can  be  used  for  development  of  algorithms  for  3D 
texture,  i.e.  image  texture  due  to  surface  roughness. 
The  changing  appearance  of  3D  texture  with  view 
and  illumination  cannot  be  studied  using  existing  tex¬ 
ture  databases  which  have  few  images  (often  a  sin¬ 
gle  image)  for  each  sample.  This  work  represents 
the  first  comprehensive  investigation  of  a  large  vari¬ 
ety  of  real-world  surfaces.  Our  future  work  is  geared 
towards  the  recognition  and  synthesis  of  real-world 
textures  and  will  rely  heavily  on  the  use  of  the  three 
databases  we  have  created. 

13  Ordinal  Measures  for  Visual 
Correspondence 

Most  traditional  approaches  to  matching  visual  infor¬ 
mation  rely  on  correlation  based  measures.  In  con¬ 
trast,  ordinal  measures  are  based  on  relative  ordering 
of  intensity  values  in  a  image  region  called  rank  per¬ 
mutation  [Bhat  and  Nayar-96,  Zabih  and  Woodfill- 
94].  By  using  distances  between  permutations  [Bhat 
and  Nayar-96,  Gideon  and  Hollister-87],  an  entire 
class  of  ordinal  measures  can  be  arrived  at.  In  [Bhat 
and  Nayar-96],  we  showed  the  utility  of  these  mea¬ 
sures  for  pixel-by-pixel  stereo  correspondence  where 
the  chief  issues  were  robustness  in  the  presence  of 
depth  discontinuities,  occlusion,  and  noise.  We  also 
discussed  methods  for  efficient  computation  of  the 
measures  which  is  important  for  practical  applica¬ 
tion. 

In  [Bhat  et  al.-91\  we  presented  a  method  for  motion 
estimation  using  ordinal  measures.  While  popular 
measures  like  the  sum-of-squared-difference  {SSD) 
and  normalized  correlation  (NCC)  rely  on  linear¬ 
ity  between  corresponding  intensity  values,  ordinal 
measures  only  require  them  to  be  monontonically  re¬ 
lated  so  that  rank  permutations  between  correspond¬ 
ing  regions  are  preserved.  This  property  turns  out  to 


1330 


be  very  useful  for  motion  estimation  in,  for  instance, 
tagged  magnetic  resonance  images.  We  studied  the 
imaging  equations  involved  in  two  methods  of  tag¬ 
ging  and  observed  temporal  monotonicity  in  inten¬ 
sity  under  certain  conditions  though  the  tags  them¬ 
selves  fade.  We  compared  our  method  to  SSD  and 
N  CC  in  a  rotating  ring  phantom  image  sequence. 
We  have  investigated  computational  issues  and  pre¬ 
sented  experiments  that  demonstrate  the  use  of  or¬ 
dinal  measures  for  motion  estimation.  Our  future 
work  will  be  focused  on  the  use  of  ordinal  measures 
for  image  retrieval  from  large  databases.  We  expect 
the  measures  to  perform  more  robustly  than  existing 
ones  based  on  correlation. 

14  lUE  development 

We  have  continued  to  take  an  active  role  in  the  lUE 
development,  working  on  system  internals,  educa¬ 
tion  and  library  development. 

One  of  our  lUE  contributions  includes  modification 
to  the  code-generation  system  and  other  system  inter¬ 
nals  to  help  dramatically  reduce  the  size  of  the  com¬ 
piled  libraries  -  while  functionality  and  classes  have 
been  continually  added  to  the  system,  the  libraries’ 
size  has  been  reduced  by  a  factor  of  about  4. 

We  have  been  active  in  the  education  of  the  commu¬ 
nity  about  the  lUE.  Lehigh  hosted,  and  Dr.  Boult  led, 
lUE  summer  camps  in  both  1 995  (6  weeks)  and  1 996 
(2  weeks),  training  approximately  20  researchers  in 
lUE  development.  We  also  helped  teach  an  lUE 
summer  training  camp  at  INRIA  for  another  15  re¬ 
searchers,  and  organized  3  workshops/one  day  tuto¬ 
rials. 

The  final  aspect  of  our  lUE  work  concentrates  on  the 
porting  of  SLAM  into  the  lUE.  SLAM  (Software  li¬ 
brary  for  Appearance  Matching  )  was  presented  at 
last  year’s  lUW  and  has  been  demonstrated  on  ob¬ 
ject  recognition  [Murase  and  Nayar-95b,  Nayar  et 
al. -95],  visual  tracking  [Nayar  et  al. -94]  and  other 
applications.  SLAM  has  been  successfully  tested  on 
a  database  with  over  100  3D  objects  many  of  which 
are  difficult  using  pure  geometric  approaches.  We 
felt  this  was  an  important  package  to  bring  into  the 
lUE.  As  SLAM  was  not  developed  with  lUE  integra¬ 
tion  in  mind,  we  felt  it  also  would  present  a  realistic 
integration  challenge. 

Last  fall  we  demonstrated  an  initial  integration, 
which  was  done  by  adding  data  conversion  between 
lUE  data  and  the  SLAM  functions  and  providing 
“wrappers”.  However,  this  decreased  performance 
and  increased  the  memory  usage.  A  systematic  ap¬ 
proach  was  adopted  to  tmly  integrate  SLAM  into  the 


lUE,  without  altering  the  stand-alone  SLAM  code. 
This  highlighted  a  weak  spot  of  the  lUE,  it  did  not 
support  raw  memory  sharing  between  vector,  matrix 
and  image  classes,  or  with  other  similar  external  data 
blocks.  We  developed  a  general  solution  using  a  ref¬ 
erence  counted  smart  pointer  block-data  class  for  the 
lUE  [Zhang  and  Boult-97].  This  class  allows  data 
between  difference  classes  to  be  shared,  and  should 
make  it  easier  for  any  other  applications  that  want  to 
“share”  their  data  with  the  lUE. 

Full  evaluation  of  the  ported  SLAM  is  still  in 
progress  to  quantify  the  performance  characteristics, 
e.g.  tolerance  to  small  features,  the  selection  of  pa¬ 
rameters  and  the  prediction  of  eigen-vector  selection 
on  the  classification  results.  We  will  also  be  explor¬ 
ing  re-implementation  of  important  components  us¬ 
ing  all  integer  mathematics,  and  development  of  a 
user  interface  for  the  lUE  version  of  SLAM. 

15  Low  Cost  Laptop  Vision 

One  of  our  project  goals  is  to  integrate  Common-Off- 
The-Shelf  into  a  portable  and  low  cost  vision  and  im¬ 
age  processing  system.  The  Laptop  Vision  System 
(LVS)  can  process  disparate  sources  of  sensory  in¬ 
formation,  e.g.  intensity,  3D  range,  infrared,  color 
and  polarized  data,  for  a  diverse  range  of  applica¬ 
tions  such  as  industrial  inspection,  target  tracking 
and  wide  field  surveillance. 

The  LVS  system  supports  both  Windows  95/NT  and 
Linux  development.  Linux  supports  the  lUE  de¬ 
velopment  on  the  LVS  and  allows  easy  remote  ac¬ 
cess/monitoring,  and  in  most  cases,  superior  perfor¬ 
mance.  Windows  provides  familiar  interface  and 
drivers  for  a  wider  array  of  specialized  devices.  We 
are  currently  using  IBM  Thinkpads  for  the  LVS.  One 
LVS  is  an  Thinkpad  760ed  which  has  an  integrated 
frame-grabber  and  provides  self  contained  operation 
at  5-lOHz.  We  also  have  an  LVS  using  755cx  us¬ 
ing  a  PCMCIA  frame-grabber,  which  is  slower  and 
not  quite  as  accurate  as  the  760ed-based  system. 
For  problems  demanding  higher  performance  we  are 
mating  the  760  laptop  with  a  docking  station  contain¬ 
ing  a  Matrox  Meteor  PCI  frame-grabber.^  The  LVS 
is  outfitted  with  wireless  PCMCIA  networking  for  to¬ 
tal  mobility.  Though  it  costs  more  than  the  desktop, 
the  LVS  offers  the  attraction  of  being  mobile  for  use 
in  battlefield  and  on  factory  floor. 

The  LVS  interfaces  either  directly  through  its  own 
parallel  port,  or  indirectly  over  the  network  to  ex- 


^  The  very  inexpensive  CMOS  camera’s  such  as  a  quickcam  are  eas¬ 
ily  adapted  to  the  system,  but  the  quality  of  their  imaging  is  usually  not 
sufficient  for  industrial,  surveillance  or  research  needs. 
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temal  hardware  such  as  a  rotary  table,  the  Active 
Eye/SPM-based  pointing  system  and  various  other 
actuators.  We  are  now  evaluating  the  added  benefit 
of  the  new  MMX  instruction  set  The  initial  evalu¬ 
ations  show  that  the  CPU  cycles  of  common  image 
processing  operations  can  be  reduced  by  a  consider¬ 
able  amount.  For  example,  using  optimized  MMX 
code  (from  Intel  corporation)  can  increase  3x3  me¬ 
dian  filtering  by  a  factor  of  4.  When  data  is  restricted 
to  1 6bit  integers,  matrix  transpose  ( up  to  1 024x  1 024 
matrix)  is  about  twice  as  fast,  and  matrix  and  vec¬ 
tor  multiplication  (up  to  2.5M  elements)  is  10  times 
faster.  We  expect  this  to  be  important  for  our  applica¬ 
tions  as  the  key  step  in  the  SLAM  software  is  exactly 
a  matrix  vector  product. 

We  have  built  a  portable  environment  for  near  real 
time  vision  and  image  processing  using  low  cost 
components.  The  development  of  vision  and  image 
processing  algorithms  is  split  between  lUE  devel¬ 
opment  and  small  stand  alone  “windows”  programs 
encapsulating  already  developed  ideas.  We  have 
looked  into  the  possibility  of  achieving  real  time  im¬ 
age  processing  using  optimized  software,  and  iden¬ 
tified  an  application  (  SLAM  )  for  evaluating  var¬ 
ious  issues  since  SLAM  itself  is  applicable  to  dis¬ 
parate  sensor  sources;  it  is  also  generic  and  power¬ 
ful  enough  for  solving  a  diverse  range  of  vision  prob¬ 
lems  real  time  with  its  simple  and  efficient  matching 
technique. 

Summary 

This  multi-disciplinary  project  has  produced  re¬ 
sults  in  advanced  sensor  development,  sensor  plan¬ 
ing,  sensor  fusion,  vision  module  development,  vi- 
sion/robotic  interfaces  and  vision  software  systems. 
We  have  produced  significant  results  in  these  areas 
and  in  so  doing  advance  the  state  of  the  art  of  visual 
sensor  systems  for  manufacturing  and  for  battlefield 
awareness.  Our  future  goals  are  to  continue  to  build 
on  the  science  we  have  laid  in  the  past  two  years,  to 
evaluate  the  component  technologies  we  have  devel¬ 
oped  and  to  transition  more  of  our  work  into  manu¬ 
facturing  practice. 
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Abstract^ 

The  Carnegie  Mellon  University  MURI  project 
sponsored  by  ONR  performs  multi-disciplinary 
research  in  integrating  vision  algorithms  with  sens¬ 
ing  technology  for  low-power,  low-latency,  com¬ 
pact  adaptive  vision  systems.  These  are  crucial 
features  necessary  for  augmenting  the  human  sen¬ 
sory  system  and  enabling  sensory  driven  informa¬ 
tion  delivery.  The  project  spans  four  sub-areas 
ranging  from  low  to  high  level  of  vision;  (1)  smart 
filters,  based  on  the  Acousto-Optic  Tunable  Filter 
(AOTF)  technology;  (2)  computational  sensor 
methodology,  which  integrates  raw  sensing  and 
computation  by  means  of  VLSI  technology;  (3) 
neural-network  based  saliency  identification  tech¬ 
niques  for  identifying  the  most  useful  information 
for  extraction  and  display;  and  (4)  visual  learning 
methods  for  automatic  signal-to-symbol  mapping. 

1,  Introduction 

Automated  vision  and  sensing  research  has  made 
great  strides  in  the  last  30  years.  Yet  vision  systems 
still  lack  attributes  shared  by  most  successful  mass- 
market  technologies  —  small  size,  low  cost,  low 
power  and  highly  reliable  performeuice.  If  com¬ 
puter  vision  processing  had  these  characteristics, 
the  potential  applications  would  be  nearly  endless. 
Examples  include:  wearable  smart  vision  systems 
for  enhancing  solder’s  situation  awareness  in  the 


1.  This  research  has  been  sponsored  in  full  or  in  part  by  Office 
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ing  the  official  policies,  either  expressed  or  implied,  of  ONR  or 
the  U.S.  Government. 


battlefield;  head-up  display  vision  enhancement 
systems  for  driving  in  bad  weather  and  low  visibil¬ 
ity  conditions;  head-up  display  field  telemedicine 
systems,  and  others.  All  these  applications  share 
common  features  —  the  applications  are  mobile 
and  interact  with  the  human  sensory  system.  While 
today  these  scenarios  are  mostly  futuristic  specula¬ 
tions,  some  of  the  technologies  they  require  have 
been  partially  demonstrated.  Our  research  further 
develops  these  emerging  technologies,  and  brings 
these  visions  closer  to  reality. 

The  CMU  MURI  project  performs  multi-disciplin¬ 
ary  research  spanning  all  levels  of  vision  and  sens¬ 
ing:  dynamically  tunable  acousto-optic 

multispectral  imaging  [Brajovic  and  Kanade, 
1997];  VLSI-based  computational 

sensors  [Brajovic  and  Kanade,  1997];  neural  net¬ 
work  saliency  detection  [Pomerleau,  1997];  auto¬ 
matic  visual  acquisition  of  object  models  [Hebert 
et  al.,  1997];  domain-independent  evolution-based 
learning  for  signal-to-symbol  mapping  [Glickman 
and  Sycara,  1997];  and  learning  coordination 
among  multiple  signal-to-symbol  mapping 
agents  [Teller  and  Veloso,  1997b].  We  believe  that 
the  tight  integration  of  vision  algorithms  and  sens¬ 
ing  technology  will  result  in  low-power,  low- 
latency,  compact,  adaptive  vision  systems  crucial 
for  effective  human  sensory  augmentation. 

1.1.  CMU  Approach 

The  separation  of  sensing  and  processing,  as  a  nat¬ 
ural  consequence  of  a  conventional  vision  system 
comprising  a  camera  and  computer,  results  in  sev¬ 
eral  deficiencies.  The  two  most  critical  features 
missing  in  this  sens-and-process  paradigm  are  low 
latency  processing  and  sensory  adaptation. 
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Latency,  or  reaction  time,  is  the  time  that  a  system 
takes  to  react  to  an  event.  The  primary  sources  of 
latency  in  vision  systems  are  the:  data  transfer  bot¬ 
tleneck  caused  by  the  need  to  transfer  an  image 
from  the  camera  to  the  processor,  and  the  computa¬ 
tional  load  bottleneck  caused  by  the  processor’s 
inability  to  quickly  handle  a  large  amount  of  visual 
data.  The  detrimental  effects  of  both  bottlenecks 
scale— up  with  the  image  size.  Often,  the  system 
“receives”  the  image  data  too  late  to  cope  with  fast 
events  or  to  provide  sensory  feedback  to  a  human 
user.  For  example,  during  the  frame  time  of  a  con¬ 
ventional  camera,  a  person’s  gaze  direction  can 
shift  by  18  degrees.  To  ensure  that  the  viewer  feels 
comfortable  and  natural  in  head-mounted  display 
applications,  for  example,  delays  must  be  less  than 
10  to  20  msec. 


Another  aspect  presently  missing  in  machine 
vision  is  top-down  sensory  adaptation.  Complex 
ad-hoc  algorithms  that  try  to  extract  relevant  infor¬ 
mation  from  inadequate  sensor  data  are  inevitably 
unreliable.  In  fact,  time  and  time  again  it  has  been 
observed  that  using  the  most  appropriate  sensing 
modality  or  setup  allows  recognition  algorithms  to 
be  far  simpler  and  more  reliable.  For  example,  the 
concept  of  active  vision  proposes  to  control  the 
geometric  parameters  of  the  camera  (e.g.,  pan,  tilt, 
etc.)  to  improve  the  reliability  of  the 
perception  [Aloimonos,  1992].  It  has  been  shown 
that  initially  ill-posed  problems  can  be  solved  after 
the  top-down  adaptation  of  the  camera’s  pose  has 
acquired  new,  more  appropriate  image  data.  How¬ 
ever,  adjusting  geometric  parameters  is  only  one 
level  at  which  adaptation  can  take  place.  Another 
example  of  adaptation  is  multi— spectral  imaging, 
which  can  eliminate  confusion  by  providing  sensor 
images  appropriate  for  the  task.  Acquisition  of 
appropriate  sensor  bands  adaptively,  however,  is 
often  difficult  since  most  multi-spectral  imaging 
devices  have  fixed  spectral  sensitivity,  while  the 
appropriate  wavelengths  to  process  vary  as  condi¬ 
tions  and  the  task  change.  Therefore,  a  system  that 
can  adjust  its  operation  at  all  levels,  even  down  to 
the  point  of  sensing,  would  be  far  more  adaptive 
than  one  that  tries  to  cope  with  the  variations  at  the 
“algorithmic”  or  “motoric”  level  alone. 

The  two  major  shortcomings  of  the  sense-and-pro- 
cess  approach  which  are  outlined  above,  along 
with  the  fact  that  this  approach  naturally  leads  to 
bulkier  and  less  cost-effective  systems,  suggest  that 


an  alternative  is  needed.  We  are  establishing  a  new 
paradigm  in  which  sensing  and  vision  processing 
are  tightly  coupled  for  fast,  time-critical,  adaptive 
operation. 

The  following  sections  describe  basic  techniques 
and  technologies  that  the  CMU  team  has  worked 
on;  we  believe  these  are  necessary  for  the  success 
of  a  low-latency  adaptive  vision  system  for  human 
sensory  augmentation. 

2.  Multi-Spectral  Imaging  Filters 


Contributors:  L.J.Denes,  M.  Gottlieb,  B. Kaminsky, 
P.  Metes,  Z.K.  Kun,  M.  Capizzi,  J.  Hibner,  D. 
Purta,  A.M.  Guzman 


This  program  task  incorporates  the  spectral  (color) 
dimension  into  the  visual  reasoning  process.  A  pro¬ 
grammable  optical  filter  is  utilized  at  the  system’s 
front  end  to  reduce  the  computational  load  and  its 
resulting  bottlenecks  in  future  automated  vision 
systems.  Filtering  the  incoming  scene  according  to 
its  spectral  composition  can  remove  a  large  amount 
of  undesirable  background  clutter  prior  to  higher 
level  processing.  Figure  1  is  a  schematic  represen¬ 
tation  of  the  process.  Enhanced  performance  is 
anticipated  in  a  variety  of  applications,  including 
human  sensory  augmentation  systems  for  driver 
assistance.  Because  of  its  ability  to  extract  and 
track  objects,  this  vision  system  will  more  closely 
mimic  the  human  observer. 


Figure  1:  Object  recognition  using  color 
discrimination. 


We  have  assembled  a  multi-spectral  imaging  sys¬ 
tem  operating  in  the  visible  to  near  IR  range  utiliz¬ 
ing  an  existing  acousto-optic  tunable  filter  (AOTF). 
This  configuration  has  been  characterized,  yielding 
design  optimization  information.  Critical  data 
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include  spatial  and  spectral  resolution,  out-of-band 
rejection,  efficiency,  field  of  view,  and  bandwidth. 
The  design  goal  is  efficient  operation  over  nearly 
two  octaves  of  wavelength,  and  superior  image 
quality.  Two  major  issues  were  successfully 
addressed.  The  first  relates  to  the  method  of  apply¬ 
ing  the  multiple  electrical  RF  control  signals  to  the 
AOTF  transducer  to  fully  exploit  the  multispectral 
capability.  Several  approaches  were  analyzed, 
including  multiple  oscillators,  spread  spectrum 
techniques,  and  the  use  of  an  arbitrary  waveform 
generator.  Recent  work  has  confirmed  that  the  arbi¬ 
trary  waveform  generator  provides  all  of  the  flexi¬ 
bility  required  with  no  serious  disadvantages.  In 
addition,  it  is  readily  adaptable  to  computer  con¬ 
trol.  The  second  issue  addressed  is  how  to  best 
achieve  object  identification  using  color  signature 
information.  A  fundamental  issue  arises  because 
any  background  object  with  a  broadband  color  dis¬ 
tribution,  e.g.,  white,  will  include  the  desired  sig¬ 
nature  within  its  spectrum.  Thus,  these  background 
objects  may  not  be  discriminated  against  the  target 
object.  To  address  this  problem,  we  developed  a 
processing  technique  using  two  video  frames,  in 
which  the  first  frame  grab  contains  a  multispectral 
image  whose  spectral  content  lies  outside  the  target 
color  signature.  This  frame  is  inverted  and  then 
used  as  a  spatial  mask  over  the  entire  scene.  The 
second  frame  grab  includes  only  the  target  color 
signature  and  provides  us  with  a  gray  scale.  By 
using  an  appropriate  threshold,  the  target  image 
alone  is  displayed  against  a  black  background. 
Tests  of  laboratory  scenes  give  encouragingly  good 
results. 

3.  Computational  Sensors  for  Low-Latency 
Adaptive  Vision 

Contributors:  Vladimir  Brajovic  and  Takeo 
Kanade 

The  computational  sensor  paradigm  [Kanade  and 
Bajcsy,  1993]  has  the  potential  to  greatly  reduce 
latency  and  provide  top-down  sensory  adaptation 
to  the  vision  system.  By  integrating  sensing  and 
processing  on  a  VLSI  chip,  both  transfer  and  com¬ 
putational  bottlenecks  can  be  alleviated:  on-chip 
routing  provides  high  throughput  transfer;  an  on- 
chip  processor  could  implement  massively-parallel 
fine-grain  computation  providing  high  processing 
capacity  which  readily  scales  up  with  the  image 


size.  In  addition,  the  tight  coupling  between  pro¬ 
cessor  and  sensor  allows  for  efficient  top-down 
feedback  that  can  control  and  adjust  the  sensor  for 
further  acquisition  based  on  the  preliminary  results 
of  the  processing. 

Our  recent  work  has  been  concerned  with  efficient 
implementation  of  global  operations  over  large 
groups  of  image  data  using  a  computational  sensor 
paradigm  [Brajovic  and  Kanade,  1994].  Global 
operations  are  important  because:  (1)  in  percep¬ 
tion,  each  decision  is  a  kind  of  global,  or  overall, 
conclusion  necessary  for  the  coherent  interaction 
with  the  environment,  and  (2)  unlike  local  opera¬ 
tions  (e.g.,  filtering)  which  produce  large  amounts 
of  preprocessed  image  data,  global  operations  pro¬ 
duce  a  few  quantities  for  the  description  of  the 
environment  which  can  be  quickly  transferred  and/ 
or  processed  to  produce  an  appropriate  action  for  a 
machine.  The  main  difficulty  with  implementing 
global  operations  comes  from  the  necessity  to 
bring  together  all  or  most  of  the  data  in  the  input 
data  set.  We  have  formulated  two  mechanisms  for 
implementing  global  operations  in  computational 
sensors:  (1)  sensory  attention  [Brajovic  and 
Kanade,  1997],  and  (2)  intensity-to-time  process¬ 
ing  paradigm  [Brajovic  and  Kanade,  1996]. 

The  sensory  attention  is  based  on  the  premise  that 
salient  features  within  the  retinal  image  represent 
important  global  features  of  the  entire  image.  This 
premise  is  attractive  for  two  reasons.  First,  the 
main  argument  that  has  been  used  to  explain  the 
need  for  selective  visual  attention  in  brains  is  that, 
as  there  exist  some  kind  of  processing  and  commu¬ 
nication  limitations  in  the  visual  system,  the  same 
exists  in  machines.  Attention  “funnels”  only  rele¬ 
vant  information  and  protects  the  limited  commu¬ 
nication  and  processing  resources  from  the 
information  overload.  Indeed,  the  importance  of 
selecting  the  relevant  information  from  an  image  is 
now  widely  acknowledged  in  machine  vision; 
some  forms  of  attention  mechanisms  (e.g.  selecting 
a  correctly  sized  window  within  the  image)  are 
often  employed  in  practical  applications.  Second,  it 
has  been  shown  that  the  visual  attention  improves 
performance,  and  is  needed  for  maintaining  coher¬ 
ent  behavior  while  interacting  with  the  environ¬ 
ment  (i.e.  attention-for-action)  [Allport,  1989]. 
Location  of  such  attention  must  be  maintained  in 
the  environmental  coordinates,  thus  maintaining 
coherence  under  ocular  and  head 
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motion  [Milanese,  1993].  Unlike  eye  movement 
(i.e.,  overt  shifts),  the  attention  shifts  (i.e.  covert 
shifts)  do  not  require  any  motor  action,  but  occur 
internally  on  a  fixed  retinal  image.  For  this  reason, 
attention  shifts  are  faster  and  play  an  important  role 
in  low-latency  vision  systems. 

We  have  implemented  sensory  attention  by  fabri¬ 
cating  and  testing  a  tracking  computational  sensor. 
This  track  sensor  optically  receives  a  saliency  map 
and  continuously  selects  and  tracks  the  peaks  in  the 
map.  The  location  and  intensity  of  the  selected 
saliency  peaks  is  reported  on  few  output  pins  with 
low  latency.  These  quantities  are  also  used  inter¬ 
nally  in  a  top-down  fashion  to  aid  tracking  of  the 
attended  location.  The  chip  is  a  28  x  28  array  of 
60|l  X  60|X  cells,  and  is  fabricated  on  a  2.2mm  x 
2.2mm  die.  When  tracing  bright,  well-defined  fea¬ 
tures,  the  sensor  tracks  targets  moving  across  the 
retina  at  about  6900  cells/second. 

The  intensity-to-time  processing  paradigm  is 
based  on  the  notion  that  stronger  signals  elicit 
responses  before  weaker  ones,  thus  allowing  a  glo¬ 
bal  processor  to  make  decisions  based  on  only  a 
few  inputs  at  a  time.  The  key  is  that  some  prelimi¬ 
nary  decisions  about  the  retinal  image  can  be  made 
as  soon  as  the  first  responses  are  received.  The 
intensity-to-time  processing  paradigm  is  used  for 
the  VLSI  implementation  of  a  sorting  computa¬ 
tional  sensor  —  a  sensor  that  sorts  input  stimuli  by 
their  intensity  as  they  are  being  sensed.  The  chip 
detects  an  image  focused  thereon  and  computes  an 
image  of  indices.  During  the  computation,  the  chip 
computes  a  cumulative  histogram  —  one  global 
quantity  of  the  detected  image  —  and  reports  it 
with  low-latency  on  one  of  the  pins  before  the 
image  is  ever  read  out.  The  cumulative  histogram 
is  used  internally  in  a  top-down  fashion  to  generate 
indices  within  each  pixel.  The  image  of  indices  has 
a  uniform  histogram  which  has  several  important 
properties:  (1)  the  contrast  is  maximally  enhanced, 
(2)  the  available  dynamic  range  of  readout  circuitry 
is  equally  utilized,  i.e.,  the  values  read  out  from  the 
chip  use  available  bits  most  efficiently,  and  (3)  the 
image  of  indices  never  saturates,  and  preserves  the 
same  range  (e.g.,  from  1  to  N)  under  varying  con¬ 
ditions  in  the  environment. 

The  adaptation  of  the  dynamic  range  of  the  sorting 
sensor  is  illustrated  in  Figure  2  showing  sequence 
of  93  images  provided  by  the  sorting  sensor.  By 


observing  the  wall  in  the  background,  we  can  see 
the  effects  of  adaptive  dynamic  range:  even  though 
the  physical  wall  does  not  change  the  brightness,  it 
appears  dimmer  in  those  frames  in  which  bright 
levels  are  taken  by  pixels  which  are  physically 
brighter  (e.g.,  subject’s  face  and  arm).  When  the 
subject  turns  and  fills  the  filed-of-view  with  dark 
objects  (e.g.,  hair)  the  wall  appears  brighter  since  it 
is  now  taking  higher  indices.  Also,  note  that  the 
maximum  contrast  is  maintained  in  all  the  images 
since  all  images  of  indices  have  uniform  histogram. 


Figure  2:  Sequence  of  images  of  indices  computed 
by  the  sorting  sensors. 


We  continue  to  work  on  an  improved  sorting  com¬ 
putational  sensor  with  smaller  pixels  and  a  larger 
array.  We  also  continue  to  work  on  developing  new 
sensors  based  on  the  intensity-to-time  processing 
paradigm.  We  have  designed,  and  recently  received 
a  prototype  of,  a  self-contained  eye  tracking  sensor. 
We  plan  to  test  the  sensor  and  apply  it  in  several 
scenarios.  In  the  near  term,  we  will  begin  interfac¬ 
ing  some  of  our  computational  sensors  with  smart 
AOTF  filters. 

4.  Visibility  Estimation  from  a  Moving 
Vehicle 

Contributor:  Dean  Pomerleau 

Reduced  visibility  caused  by  fog,  rain,  snow,  dark¬ 
ness  and  glare  is  a  frequent  contributing  factor  to 
traffic  accidents  [Najm  et  al.,  1995].  In  fact,  some 
of  the  most  serious  of  all  highway  incidents,  some¬ 
times  involving  dozens  or  even  hundreds  of  vehi- 
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cles,  occur  when  reduced  visibility  conditions 
result  in  a  chain  reaction  of  crashes.  Technologies 
typically  employed  to  estimate  visibility  include: 
transmissometers,  which  measure  the  transmit¬ 
tance  of  the  atmosphere  over  a  baseline  distance; 
and  nephelometers,  which  measure  the  scattering 
coefficient  caused  by  suspended  particles  in  an  air 
sample  [National  Weather  Service  1996].  Unfortu¬ 
nately,  these  systems  suffer  from  several  draw¬ 
backs  as  they  are  not  always  estimating  visibility 
from  the  driver’s  point  of  view.  The  only  way  to 
automatically  estimate  the  cumulative  influence  of 
these  factors  on  the  driver’s  ability  to  see  potential 
obstacles  ahead  is  to  employ  a  sensing  system 
which  reasonably  matches  the  driver’s  perceptual 
characteristics.  We  developed  a  system  that  accom¬ 
plishes  this  match  by  using  a  CCD  video  camera 
pointing  out  the  windshield  of  the  vehicle,  and  pro¬ 
cessing  the  same  features  processed  by  the  human 
driver  to  estimate  visibility. 

Manual  visibility  estimates  are  typically  made  by 
attempting  to  detect  high  contrast  targets  at  various 
known  distances.  The  farthest  distance  at  which  a 
target  can  be  reliably  detected  is  considered  the 
visibility  distance.  Ideally,  an  automated  visibility 
estimation  system  should  work  the  same  way. 
Unfortunately,  it  is  very  difficult  to  consistently 
find  high  contrast  targets  at  various  known  ranges 
from  a  moving  vehicle.  Even  the  features  that  are 
supposed  to  be  consistent  on  a  roadway,  the  lane 
markings,  vary  greatly  in  their  appearance,  and  are 
in  fact  frequently  missing  or  obscured.  The  Rap¬ 
idly  Adapting  Lateral  Position  Handler  (RALPH) 
system  [Pomerleau  and  Jochem,  1996]  overcomes 
this  difficulty  when  detecting  the  position  and  cur¬ 
vature  of  the  road  ahead  in  camera  images  by  uti¬ 
lizing  whatever  features  are  visible  on  the  roadway. 
These  features  may  include  lane  markings,  road/ 
shoulder  boundaries,  tracks  left  by  other  vehicles, 
and  even  subtle  pavement  discolorations  like  the 
oil  stripe  down  the  lane  center  when  necessary.  Our 
visibility  estimation  system  exploits  RALPH’S 
ability  to  find  and  track  arbitrary  road  features.  In 
short,  the  system  estimates  visibility  by  measuring 
the  attenuation  of  contrast  between  consistent  road 
features  at  various  distances  ahead  of  the  vehicle. 

The  visibility  estimation  algorithm  performs  well 
under  a  wide  variety  of  conditions.  The  rank  order¬ 
ing  of  six  conditions  tested  corresponds  reasonably 
well  to  one’s  intuitive  notion  of  how  difficult  it  is  to 


see  in  these  situations.  Live  vehicle  tests  in  fog  still 
need  to  be  conducted  (fog  is  rare  in  Pennsylvania, 
particularly  during  the  winter  when  these  experi¬ 
ments  were  conducted).  However,  the  results  from 
the  simulated  fog  experiments  and  the  live  daytime 
tests  in  rainy  conditions  suggest  that  the  algorithm 
should  perform  well,  and  report  significantly 
reduced  visibility  under  foggy  conditions. 

While  all  the  work  reported  here  has  been  done 
with  a  standard  black  and  white  CCD  camera,  we 
are  investigating  the  potential  for  using  alternative 
sensors  for  improved  performance.  For  example,  a 
high-dynamic  range  camera,  such  as  a  VLSI  sort¬ 
ing  computational  sensor,  would  respond  more  like 
the  human  eye  in  extreme  lighting  conditions,  and 
could  therefore  provide  better  visibility  estimates. 
Another  possibility  would  be  to  combine  this  visi¬ 
bility  estimation  technique  with  smart  AOTFs  for 
multispectral  imaging.  By  testing  the  visibility  at 
different  wavelengths,  it  may  be  possible  to  select 
the  best  wavelength(s)  for  operation  under  the  cur¬ 
rent  conditions. 

5.  Multi-Agent  Learning  for  Signal 
Classification  in  Vision 

Contributors:  Astro  Teller  and  Manuela  Veloso 

A  wide  variety  of  machine  learning  mechanisms 
create  multiple  models  that  must  be  reconciled, 
chosen  among,  or  in  some  cases,  orchestrated.  In 
its  most  general  form,  this  orchestration  problem 
can  be  seen  as  part  of  the  multi-agent  learning 
problem. 

There  are  many  cases  in  which  a  task  to  be 
approached  with  machine  learning  techniques  can 
be,  or  must  be,  solved  in  more  that  one  “piece.” 
Learning  a  team  of  robotic  soccer  players  is  a  good 
example  of  a  task  that  could  conceivably  be  done 
as  a  single  agent,  but  lends  itself  very  naturally 
toward  learning  sub-solutions  and  then  (or  in  addi¬ 
tion)  learning  to  ensure  the  mutual  suitability  of 
these  sub-solutions.  This  insurance  of  mutual  suit¬ 
ability  is  the  orchestration  problem. 

Evolutionary  computation  is  a  natural  machine 
learning  environment  in  which  to  find  many, 
behaviorally  distinct  models.  We  focus  on  PADO,  a 
evolutionary  computation  framework  designed 
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specifically  for  signal  classification  (e.g.,  [Teller 
and  Veloso,  1997b]).  As  a  process  of  divide  and 
conquer,  PADO  evolves  multiple  pools  of  sub¬ 
solutions  and  then  orchestrates  one  or  more  learned 
models  from  each  pool. 

The  question  we  investigate  is:  “What  opportuni¬ 
ties  are  there  for  learning  in  the  orchestration  pro¬ 
cess  and  how  much  improvement  can  this  learning 
provide?”  While  answering  this  question,  our 
research  demonstrated  several  things  [Teller  and 
Veloso,  1997b].  First,  specific  experiments  on  dis¬ 
tinct  signals  demonstrated  the  feasibility  of 
PADO’s  divide  and  conquer  strategy;  the  failure  of 
the  evolved  orchestration  procedure  suggested 
PADO’s  preferability  to  unconstrained  learning. 
Second,  the  experiments  provided  a  specific  justifi¬ 
cation  for  maintaining  a  population;  orchestration 
puts  the  options  a  population  provides  to  good  use. 
And  finally,  this  work  introduced  specific  tech¬ 
niques  for  orchestration  learning  and,  through  their 
successful  application,  demonstrated  that  orches¬ 
tration  is  an  important  issue  and  that  learned 
orchestration  can  provide  dramatic  generalization 
improvements. 

6.  Adaptive  Acquisition  of  Search  Control 
Knowledge  in  the  Evolution  of  Face 
Recognition  Neural  Networks 

Contributors:  Matthew  Glickman  and  Katia  Sycara 

Search  algorithms  for  signal-to-symbol  matching 
patterned  after  biological  evolution  are  attractive 
for  use  in  domains  such  as  vision  that  have  com¬ 
plex  search  spaces  for  a  number  of  reasons.  These 
include:  (1)  Their  application  does  not  explicitly 
require  deep  insight  into  the  domain;  (2)  They  are 
relatively  straightforward  to  paralyze;  and  (3)  their 
natural  analog  has  resulted  in  entities  of  extraordi¬ 
nary  complexity  and  robustness.  However,  the 
search  performance  in  any  particular  domain  is 
highly  dependent  on  the  interaction  between  the 
chosen  representation  of  the  space  and  the  specific 
search  operators  employed.  For  evolutionary  algo¬ 
rithms  in  particular,  this  interaction  is  a  poorly 
understood  process,  leaving  practitioners  with  few 
guidelines  as  to  how  to  make  the  right  choices  to 
yield  good  performance. 

One  popular  approach  to  improving  the  perfor¬ 


mance  of  search  in  a  particular  domain  is  to  seek  to 
incorporate  pre-existing  knowledge  of  the  domain 
into  the  operators  and  representation.  However, 
this  approach  is  problematic  for  evolutionary 
search  because  of  the  aforementioned  opacity  of 
the  interaction  between  the  operators  and  the  repre¬ 
sentation.  This  difficulty,  popularly  known  as  "the 
representation  problem,”  is  only  compounded  in 
more  complex  domains,  presenting  a  formidable 
obstacle  to  the  application  of  artificial  evolution  in 
precisely  those  domains  in  which  they  may  be  of 
the  greatest  utility. 

Therefore,  rather  than  seeking  to  find  how  pre¬ 
existing  domain  knowledge  can  be  best  exploited 
by  evolution,  our  research  is  directed  toward  the 
automatic  acquisition  of  such  knowledge  in  opera¬ 
tional  form.  The  experiments  reported  herein  dem¬ 
onstrate  that  information  about  a  particular  domain 
generated  over  the  course  of  evolutionary  search 
can  be  extracted,  analyzed,  and  then  employed  to 
improve  search  in  future  runs. 

The  space  explored  is  the  weight  space  of  fixed- 
topology,  feed-forward  artificial  neural  networks 
(ANNs)  for  face  recognition.  Over  the  course  of 
adaptation,  weight  vectors,  along  with  their  self- 
adapted,  variable  mutation  rate,  were  collected. 
These  data  were  then  used  to  train  another  ANN  to 
predict  the  appropriate  mutation  rate  for  a  given 
weight  vector  for  the  face-recognition  domain  in 
general.  Finally  the  mutation  rate-prediction  net¬ 
works  were  used  to  drive  evolution  on  another  face 
recognition  task,  resulting  in  networks  with 
improved  generalization  performance. 

Our  preliminary  results  indicate  that  this  approach 
is  reliably  feasible.  Due  to  the  fact  that  (1)  the  spe¬ 
cific  weight-vector/mutation-rate  pairs  chosen  for 
training  were  selected  via  a  simple,  Darwinian 
selection  process,  and  (2)  that  the  target  mutation 
rates  contained  in  these  data  had  also  been  adapted 
via  this  same  selection  process,  the  results  reported 
here  indicate  that  simple  Darwinian  selection  is 
sufficient  to  generate  a  training  signal  from  which 
domain/search-control  knowledge  may  be 
extracted.  This  result  indicates  a  promising  direc¬ 
tion  for  the  successful  application  of  artificial  evo¬ 
lution  in  complex  domains  such  as  image 
understanding. 
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7.  Visual  Learning  for  Landmark 
Recognition 

Contributors:  Martial  Hebert,  Katsushi  Ikeuchi, 
Yukata  Takeuchi,  Patrick  Gros 

Recognizing  landmarks  is  a  critical  task  for  inter¬ 
action  of  a  machine  with  the  environment.  Land¬ 
marks  are  used  for  building  maps  of  unknown 
environments.  In  this  context,  the  traditional  recog¬ 
nition  techniques  based  on  strong  geometric  mod¬ 
els  cannot  be  used.  Rather,  models  of  landmarks 
must  be  built  from  observations  obtained  using 
image-based  techniques.  This  section  describes 
building  image-based  landmark  descriptions  from 
sequences  of  images,  and  then  recognizing  the 
landmarks.  This  approach  also  addresses  the  more 
general  problem  of  identifying  groups  of  images 
with  common  attributes  in  sequences  of  images. 
We  show  that,  with  the  appropriate  domain  con¬ 
straints  and  image  descriptions,  this  can  be  done 
using  efficient  algorithms. 

Recognizing  landmarks  in  sequences  of  images  is  a 
challenging  problem  for  a  number  of  reasons.  The 
appearance  of  any  given  landmark  varies  substan¬ 
tially  from  one  observation  to  the  next.  In  addition, 
to  variation  due  to  different  aspects,  illumination 
change,  external  clutter,  and  changing  geometry  of 
the  imaging  devices  are  other  factors  affecting  the 
variability  of  the  observed  landmarks.  Finally,  it  is 
typically  difficult  to  use  accurate  3D  information  in 
landmark  recognition  applications.  For  those  rea¬ 
sons,  it  is  not  possible  to  use  many  of  the  object 
recognition  techniques  based  on  strong  geometric 
models. 

The  alternative  is  to  use  image-based  techniques  in 
which  landmarks  are  represented  by  collecting 
images  which  are  supposed  to  capture  the  “typical” 
appearance  of  the  object.  The  information  most  rel¬ 
evant  to  recognition  is  extracted  from  the  collection 
of  raw  images  and  used  as  the  model  for  recogni¬ 
tion.  This  process  is  often  referred  to  as  “visual 
learning.” 

Progress  has  been  made  recently  in  developing 
such  approaches.  For  example,  in  object 
modeling  [Gross  et  al.],  2D  or  3D  model  of  objects 
are  built  for  recognition  applications.  An  object 
model  is  built  by  extracting  features  from  a  collec¬ 


tion  of  observations.  The  most  significant  features 
are  extracted  for  the  entire  set  and  are  used  in  the 
model  representation.  Extensions  to  generic  object 
recognition  were  presented  recently  [Carlsson, 
1996]. 

Other  recent  approaches  use  the  images  directly  to 
extract  a  small  set  of  characteristic  object  images 
which  are  compared  with  observed  views  at  recog¬ 
nition  time.  For  example,  the  eigen-images  tech¬ 
niques  are  based  on  this  idea. 

Those  approaches  are  typically  used  for  building 
models  of  a  single  object  observed  in  isolation.  In 
the  case  of  landmark  recognition  for  navigation, 
there  is  no  practical  way  to  isolate  the  object  in 
order  to  build  models.  Worse,  it  is  often  not  known 
in  advance  which  of  the  objects  observed  in  the 
environment  would  constitute  good  landmarks. 
Visual  learning  must  therefore  be  able  to  identify 
groups  of  images  corresponding  to  “interesting” 
landmarks  and  to  construct  models  amenable  to 
recognition  out  of  raw  sequences  of  images. 

A  similar  problem,  although  in  a  completely  differ¬ 
ent  context,  is  encountered  in  image  indexing, 
where  the  main  problem  is  to  store  and  organize 
images  to  facilitate  their  retrieval  [Lamiroy  and 
Gros,  1996]  [Schmid  and  Mohr,  1996].  The 
emphasis  in  this  case  is  on  the  kind  of  features  used 
and  the  types  of  requests  that  can  be  made  by  the 
user.  For  image  retrieval,  actual  systems  (QBIC, 
JACOB,  Virage...)  are  closer  to  smart  browsing 
than  to  image  recognition.  Using  criteria  such  as 
color,  shape,  regions,  etc.,  the  systems  search  for 
images  most  similar  to  a  given  image.  The  user  can 
then  interact  with  the  system  to  define  which  of 
these  images  seems  the  most  interesting,  and  a  new 
set  of  closer  images  is  displayed. 

Our  system  tries  to  combine  those  two  categories 
of  systems.  In  a  training  stage,  the  system  is  given 
a  set  of  images  in  sequence.  The  aim  of  the  training 
is  to  organize  these  images  into  groups  based  on 
similarity  of  feature  distributions  between  images. 
The  size  of  the  groups  obtained  may  be  defined  by 
the  user,  or  by  the  system  itself.  In  the  latter  case, 
the  system  tries  to  find  the  most  relevant  groups, 
taking  the  global  distribution  of  the  images  into 
account.  In  a  second  step,  the  system  is  given  new 
images,  which  it  tries  to  classify  as  either  one  of 
the  learned  groups,  or  belonging  to  the  category  of 
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unrecognized  images.  Figure  4  shows  indentifying 
landmarks  from  a  moving  vehicle. 


°-°500.0  0.0  500.0  1000.0 


Figure  3:  Overhead  view  of  the  path  followed 
while  collecting  the  images.(distances  are 
indicated  in  meters.)  Four  landmarks  are 
correctly  identified,  corresponding  to  groups  2, 
5,  6,  and  7  of  the  training  sequence.  Example 
images  from  the  test  sequence  are  shown  for 

The  basic  representation  is  based  on  distributions 
of  different  feature  characteristics.  All  these  differ¬ 
ent  kinds  of  histograms  are  computed  for  the  whole 
image  and  for  a  set  of  sub-images.  Tests  similar  to 
Chi-square  tests  are  used  to  compare  these  histo¬ 
grams  and  define  a  distance  between  images.  This 
distance  is  then  used  to  cluster  the  images  in  what 
are  called  groups.  An  agglomerative  grouping 
algorithm  is  used  at  this  stage.  At  each  step  of  the 
algorithm,  the  clusters  made  are  evaluated  by  an 
entropy-like  function,  whose  maximum  gives  the 
optimal  solution  in  a  sense  specified  later.  Each 
group  is  then  characterized  by  a  set  of  feature  his¬ 
tograms.  When  new  images  are  given  to  the  sys¬ 
tem,  it  evaluates  a  distance  between  these  images 
and  the  groups.  The  system  determines  to  which 


group  this  image  is  the  closest,  and  a  set  of  thresh¬ 
olds  is  used  to  decide  if  the  image  belongs  to  this 
group. 

The  main  goal  of  the  work  presented  here  was  to 
explore  the  use  of  tools  and  methods  in  the  field  of 
image  retrieval  when  applied  to  the  problem  of 
landmark  recognition.  It  is  clear  that  the  global 
architecture  of  the  system  is  close  to  that  of  object 
recognition  systems  [Gross  et  al.];  a  training  stage 
in  which  3D  shape,  2D  aspects,  or  groups,  are  char¬ 
acterized  is  followed  by  a  recognition  stage  in 
which  this  information  is  used  to  recognize  the 
models,  objects  or  groups  in  new  images.  The  dif¬ 
ference  comes  from  the  wide  diversity  of  the 
images  and  from  the  groups  which  are  not  reduced 
to  a  single  aspect  of  an  object.  The  two  challenging 
tasks  which  we  concentrate  on  describing  in  the 
remainder  of  the  paper  are  to  define  these  groups 
more  precisely  as  sets  of  images,  and  to  automati¬ 
cally  learn  a  characterization  for  each  group:  what 
remains  invariant,  what  varies,  and  in  which  pro¬ 
portions. 

8.  Conclusion 

CMU  MURI  performs  cross  -disciplinary  research 
which  will  result  in  high  performance  vision  sys¬ 
tems  adequate  for  “natural”  human  sensory  aug¬ 
mentation  and  sensor  driven  information  delivery. 
We  are  demonstrating  progress  in  all  levels  of 
vision:  from  image  formation  and  computational 
sensing  to  high  level  adaptive  context— independent 
learning  strategies.  We  believe  that  the  tight  inte¬ 
gration  of  these  techniques  will  provide  opportu¬ 
nity  for  more  efficient  bottom-up  and  top-down 


Figure  4:  A  vision  system  with  tight  integration  of 
image  formation,  sensing  and  processing  for 
adaptive  low-latency  applications. 

control  in  vision  processes  which  will  result  in 
low-power,  low-latency,  compact,  reliable  and 
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adaptive  vision  systems  (see  Figure  4)  crucial  for 

effective  human  sensory  augmentation. 
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Abstract 

Our  Center  will  continue  to  develop  general- 
purpose  autonomous  systems  for  vision,  object 
recognition,  and  control  applications.  The  sys¬ 
tems  are  realized  in  software,  off-the-shelf  hard¬ 
ware,  and  customized  chips.  These  systems  are 
designed  to  operate  within  noisy  environments 
for  which  rules  are  not  known  and  which  can 
change  unexpectedly  through  time.  They  typ¬ 
ically  begin  with  models  of  a  key  brain  com¬ 
petence  and  end  with  fielded  applications  that 
have  been  thoroughly  benchmarked.  The  de¬ 
sign  of  adaptive  algorithms  will  be  emphasized, 
as  will  transfer  of  well-characterized  algorithms 
to  a  larger  class  of  applications  and  to  real¬ 
time  platforms.  New  projects  will  continue  to 
include  psychophysical  studies  of  how  humans 
search  complex  scenes;  models  of  coherent  pro¬ 
cessing  of  noisy  and  incomplete  image  data  from 
natural  and  artificial  sensors;  development  of 
self-organizing  classifiers  capable  of  fast,  stable, 
distributed,  incremental  learning  and  hypoth¬ 
esis  testing  in  response  to  nonstationary,  in¬ 
complete,  and  probabilistic  data;  algorithm  and 
hardware  development  for  head-mounted  space- 
variant  active  vision  systems;  development  of 
self-calibrating  autonomous  robots;  and  fabri¬ 
cation  of  chips  for  vision  and  classification  ap¬ 
plications. 


*  Supported  in  part  by  the  Defense  Advanced  Re¬ 
search  Projects  Agency  and  the  Office  of  Naval  Research 
(ONR  N00014-95-1-0409). 


1  Introduction 

This  report  summarizes  new  research  projects 
to  be  conducted  under  the  Multidisciplinary 
University  Research  Initiative  (MURI)  program 
by  the  Boston  University  Department  of  Cogni¬ 
tive  and  Neural  Systems,  the  Boston  Univer¬ 
sity  College  of  Engineering,  and  the  Johns  Hop¬ 
kins  University  Department  of  Electrical  and 
Computer  Engineering.  Our  companion  Tech¬ 
nical  Report  [Grossberg  et  al,  1997]  reviews 
our  research  approach  and  some  of  our  current 
efforts  to  develop  general-purpose  autonomous 
neural  systems  for  vision,  object  recognition 
and  control  applications.  The  present  PI  re¬ 
port  briefly  lists  the  objectives,  research  ques¬ 
tions,  and  evaluation  procedures  that  will  be 
used  in  our  continuing  work  on  these  and  re¬ 
lated  projects.  Background  information  should 
be  sought  in  the  Technical  Report. 

2  Boundary  and  Surface  Processing 
of  Natural  and  Synthetic  Images 
(Investigators:  A.  Baloch,  S. 
Grossberg,  R.  Raizada,  J. 
Williamson) 

Objectives:  To  continue  development  of 

boundary  segmentation  and  surface  represen¬ 
tation  algorithms  to  enhance  image  data  from 
synthetic  aperture  radar  (SAR),  multispec- 
tral  infrared  (IR),  laser  detection  and  ranging 
(LADAR),  and  related  sensors  for  use  by  ex¬ 
pert  photointerpreters,  by  non-expert  users  in 
battlefield  conditions,  and  as  a  preprocessor  for 
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such  image  data  before  it  is  automatically  classi¬ 
fied  by  adaptive  pattern  recognition  algorithms. 
These  results  may  thus  be  used  for  a  wide  va¬ 
riety  of  image  exploitation,  visual  surveillance, 
and  geospatial  modeling  applications. 

Research  Questions:  Two  types  of  research 
issues  will  be  emphasized.  The  first  concerns 
how  these  boundary  segmentation  and  surface 
representation  circuits  can  self-organize  their 
own  optimal  operating  parameters.  Such  a  de¬ 
velopment  would  be  important  from  at  least  two 
perspectives:  It  would  greatly  reduce  the  time 
needed  to  design  such  a  circuit  by  allowing  the 
circuit  itself  to  discover  its  best  operating  pa¬ 
rameters.  It  would  also  enable  the  circuit  to 
autonomously  recruit  processing  resources  in  re¬ 
sponse  to  changing  environmental  statistics  to 
better  discriminate  data  features  that  may  have 
not  been  anticipated  in  its  original  design.  The 
second  research  issue  concerns  how  to  realize 
this  latest  generation  of  algorithms  in  commer¬ 
cial  off-the  shelf  digital  signal  processing  boards 
that  can  run  in  real-time,  and  to  work  with 
our  VLSI  (very  large  scale  integrated  circuits) 
teams  at  Johns  Hopkins  and  Boston  University 
to  begin  implementing  them  in  compact,  light¬ 
weight,  low-power,  real-time  chips. 

As  noted  in  Grossberg  et  al.  [1997],  the  circuits 
in  question  have  been  derived  from  an  analy¬ 
sis  of  how  the  visual  cortex  carries  out  simi¬ 
lar  tasks.  It  is  well-known  from  neurobiological 
experiments  that  these  cortical  circuits  may  be 
tuned  by  visual  experience.  This  is  true  both  for 
the  adaptive  filters  that  regulate  the  bottom- 
up  flow  of  signal  processing,  and  for  the  hori¬ 
zontal  connections  that  subserve  the  boundary 
segmentation  process.  We  will  investigate  how 
both  types  of  circuits  may  learn  their  own  best 
operating  parameters. 

To  implement  these  algorithms  in  off-the-shelf 
digital  signal  processing  boards,  we  will  in¬ 
vestigate,  among  others,  the  Texas  Instrument 
TMS320C80,  which  has  been  designed  for  imag¬ 
ing  and  graphics  applications.  The  C80  com¬ 
bines  4  digital  signal  processors,  a  RISC  pro¬ 
cessor,  and  an  I/O  controller  on  a  single  chip 
optimized  for  bandwidth-intensive  applications 
requiring  massive  parallel  processing.  The  C80 


will  be  adapted  for  real-time  implementation  of 
neural  systems  for  processing  both  static  scenes 
and  attentive  motion  grouping  applications,  as 
noted  in  Section  6. 

Evaluation:  The  self-organization  project  will 
first  be  evaluated  by  using  the  new  adaptive  cor¬ 
tical  circuit  to  explain  and  simulate  key  neu¬ 
rophysiological  data  about  the  timed  formation 
of  identified  cortical  connections.  When  the  bio¬ 
logical  version  of  the  model  is  finished,  it  will  be 
used,  as  in  the  case  of  the  non-adaptive  model, 
to  enhance  SAR  data.  First,  the  adaptive  model 
will  be  trained  on  SAR  data  to  study  how  the 
statistical  properties  of  these  data  determine 
the  best  filter  and  grouping  parameters  through 
learning.  This  self-organized  circuit  will  then  be 
benchmarked  against  the  hand-crafted  circuits 
that  have  previously  been  used.  Finally,  in  later 
years,  the  circuit  will  be  trained  and  tested  on 
data  from  other  sensors  to  understand  how  the 
self-organized  parameters  for  each  sensor  type 
may  differ,  and  to  develop  a  strategy  for  imple¬ 
menting  a  general-purpose  boundary  and  sur¬ 
face  processing  system  with  these  results  as  a 
guide. 

The  circuit  board  implementation  project  will 
acquire  a  COTS  C80/PCI  powered  board  with 
associated  hardware  and  software,  and  immedi¬ 
ately  study  how  to  achieve  maximum  scalability 
of  such  boards  for  future  projects.  Software  li¬ 
braries  for  basic  circuit  modules  of  the  above 
architectures  will  then  be  developed.  These 
modules  will  be  integrated  into  architectures  for 
processing  textured  scenes.  The  results  derived 
with  the  real-time  hardware  will  be  compared 
with  those  derived  on  present  versions  of  the 
models. 

In  addition  to  these  research  projects,  we  will 
also  continue  to  transfer  this  technology  as  it 
develops  to  users  like  the  Machine  Intelligence 
Group  at  MIT  Lincoln  Laboratory,  for  testing 
on  their  classified  DoD  data. 


3  ARTEX  Classifiers  of  3-D  Objects 
and  Textured  Scenes 
(Investigators:  J.  Brown,  S. 
Grossberg,  J.  Williamson) 

Objectives:  There  are  three  main  objectives  of 
this  project:  (1)  To  develop  a  version  of  Gaus¬ 
sian  ARTMAP  that  uses  more  local  computa¬ 
tions,  and  thus  one  that  should  be  more  com¬ 
putationally  efficient  and  embeddable  within  a 
larger  architecture  for  high-level  III.  (2)  To  im¬ 
plement  the  ARTEX  architecture  for  classifica¬ 
tion  of  textured  regions  in  a  scene,  notably  from 
SAR  and  other  military  sensors,  in  real-time 
commercial  off-the-shelf  hardware  and  in  cus¬ 
tom  VLSI.  These  results  should  provide  real¬ 
time  recognition  of  both  natural  and  man-made 
objects  that  are  detected  by  a  wide  variety  of 
platforms.  (3)  To  embed  the  ARTEX  architec¬ 
ture  into  a  larger  architecture  for  recognition 
also  of  3-D  objects  and  more  complex  scenic 
configurations. 

Research  Questions:  Our  approach  to  the 
first  two  problems  will  follow  the  same  format 
as  in  Section  2. 

Evaluation:  The  projects  in  Section  2  will 
provide  an  improved  front-end  for  the  family 
of  Gaussian  ARTMAP  classifiers  that  will  be 
further  developed  in  this  project.  In  addition, 
the  new  Gaussian  ARTMAP  algorithms  will 
be  benchmarked  against  the  Expectation  Max¬ 
imization  (EM),  rule-based,  multilayer  percep- 
tron,  and  K-NN  algorithms  that  were  useful  in 
our  study  of  the  previous  version.  The  larger 
ARTEX  architecture  will  be  incorporated  into 
a  previously  benchmarked  VIEWNET  architec¬ 
ture  for  incrementally  learning  to  recognize  3-D 
objects  from  sequences  of  their  2-D  views  [Brad- 
ski  and  Grossberg,  1995].  This  version  of  the 
VIEWNET  architecture  will  be  fitted  with  a 
new  What-and- Where  invariance  filter  [Carpen¬ 
ter,  Grossberg,  and  Lesher,  1997]  which  will 
enable  the  system  as  a  whole  to  incrementally 
learn  predictive  relationships  among  the  objects 
of  a  scene  and  their  spatial  locations. 


4  Model  of  3-D  Vision  and 
Figure-Ground  Separation 
(Investigators:  S.  Grossberg,  F. 
Kelly,  R.  Paine) 

Objectives:  To  further  develop  the  FACADE 
model  of  3-D  vision  and  to  apply  it  to  more 
realistic  scenes  in  which  overlapping  occluding 
and  partially  occluded  objects  occur.  As  noted 
in  Grossberg  et  al.  [1997],  such  preprocessing 
of  an  image  before  it  is  input  to  a  pattern  clas¬ 
sifier  can  lead  to  higher  classification  accuracy 
of  overlapping  objects  in  the  types  of  cluttered 
scenes  that  occur  in  the  battlefield. 

Research  Questions:  Research  will  focus 
upon  how  the  multiple  scales  of  the  FAGADE 
model  work  together  to  generate  3-D  bound¬ 
ary  and  surface  representations  that  correctly 
parse  increasingly  complex  2-D  images  and  3- 
D  scenes  into  depthful  surface  representations. 
The  completion  of  both  boundary  and  surface 
representations  of  partially  occluded  objects  for 
purposes  of  pattern  recognition  will  also  be  em¬ 
phasized. 

Evaluation:  The  FACADE  model  will  be  fur¬ 
ther  tested  and  developed  by  simulating  key 
psychophysical  examples  of  how  human  ob¬ 
servers  do  3-D  figure-ground  separation.  A 
large  number  of  displays  will  be  simulated  to 
derive  multiple  constraints  on  the  circuit  de¬ 
sign.  These  simulations  will  include  pop-out 
of  occluding  figures  and  amodal  completion  of 
occluded  figures  in  response  to  line  drawings, 
to  surface  renderings  wherein  the  contrast  re¬ 
lationships  between  abutting  regions  are  given 
arbitrary  relative  values,  to  displays  in  which 
either  transparent  or  opaque  occlusion  percepts 
can  obtain,  to  displays  in  which  relative  bright¬ 
ness  can  in  itself  cause  pop-out,  and  to  ambigu¬ 
ous  stratification  displays  in  which  bistable  re¬ 
versals  of  occluding  and  occluded  surfaces  oc¬ 
curs.  After  these  psychophysical  displays  are 
simulated,  we  will  begin  to  evaluate  the  model’s 
pop-out  and  amodal  completion  properties  on 
images  derived  from  military  sensors. 
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5  ART  and  ARTMAP  Neural 
Networks  for  Applications: 
Self-Organizing  Learning, 
Recognition,  and  Prediction 

5.1  Geospatial  Mapping 

(Investigators:  G.A.  Carpenter, 

J.  Franklin,  S.  Gopal,  S. 
Macomber,  S.  Martens,  C. 
Woodcock) 

Objectives:  A  remote  sensing  testbed  allows 
performance  comparisons  between  candidate 
neural  network  systems  and  state-of-the-art  im¬ 
age  processing  and  recognition  techniques,  in 
a  collaborative  project  with  researchers  at  the 
Boston  University  Center  for  Remote  Sensing. 

Research  Questions:  Phase  1  of  this  project 
developed  fuzzy  ARTMAP  networks  that  were 
specialized  to  identify  vegetation  classes  based 
on  input  that  includes  6  spectral  bands  (visi¬ 
ble  and  IR)  and  terrain  variables,  such  as  as¬ 
pect,  slope,  and  elevation.  Researchers  work¬ 
ing  on  the  new  NASA  Earth  Observing  System 
(EOS),  are  now  using  this  work  to  help  design 
systems  for  image  analysis,  data  compression, 
feature  extraction,  and  temporal  prediction,  to 
be  placed  in  satellites  scheduled  for  launch  in 
1998  and  beyond.  Phase  2  of  the  remote  sensing 
project  began  with  the  planning  of  data  collec¬ 
tion  for  a  study  in  the  Plumas  National  Forest, 
in  northern  California,  resulting  in  a  rich  ground 
truth  data  set  that  is  now  the  basis  for  ongoing 
comparative  studies.  This  project  is  currently 
developing  automatic  methods  for  mixture  pre¬ 
diction  for  mapping  tasks  in  which  sites  typi¬ 
cally  feature  a  composite  of  output  classes,  such 
as  a  vegetation  mixture,  rather  than  a  single 
class.  Researchers  at  MIT  Lincoln  Laboratory 
are  already  considering  this  mixture  prediction 
method  for  remote  sensing  problems  that  range 
from  finding  tanks  in  infrared  imagery  to  ecosys¬ 
tem  mapping. 

Evaluation:  Phase  3  will  use  a  larger  database 
to  develop,  test,  and  field  the  results  of  the  neu¬ 
ral  network  prediction  algorithms.  New  network 
hierarchies  that  take  advantage  of  the  database 
size  and  structure  are  also  under  development. 
The  goal  of  this  work  is  to  produce  systems 


that  will  rapidly  provide  accurate  geospatial 
maps  from  high-dimensional  satellite  and  ter¬ 
rain  data.  These  methods  could  be  applied  for 
military  intelligence  as  well  as  for  the  large-scale 
environmental  mapping  problems  of  the  unclas¬ 
sified  system  development  testbed. 

5.2  Computer- Assisted  Medical 

Diagnosis  (Investigators:  G.A. 

Carpenter,  R.  D’Agostino,  J. 
Griffiths,  N.  Terrin) 

Objectives:  Medical  database  analysis  has 
provided  a  fruitful  medium  in  which  to  develop 
and  test  new  learning  algorithms,  resulting  in 
computational  advances  such  as  those  of  the 
ARTMAP-IC  (instance  counting)  network. 

Research  Questions:  A  new  opportunity  for 
collaboration  with  medical  statisticians  and  re¬ 
searchers  at  the  New  England  Medical  Cen¬ 
ter  has  recently  arisen.  This  project,  which 
will  test  and  then  field  learning  algorithms  in 
medical  settings,  will  give  our  researchers  ac¬ 
cess  to  large-scale  databases  as  well  as  cutting- 
edge  biomedical  expertise.  Medical  databases 
present  many  of  the  same  challenges  as  might 
be  found  in  other  information  management  set¬ 
tings,  including  the  battlefield,  where  speed,  ef¬ 
ficiency,  ease  of  use,  and  accuracy  are  at  a  pre¬ 
mium.  In  addition,  a  direct  goal  of  improved 
computer-assisted  medicine  is  to  help  deliver 
quality  emergency  care  in  situations  that  may 
be  less  than  ideal,  including  those  that  military 
medical  personnel  might  encounter. 

Evaluation:  Systems  will  be  evaluated  in 
terms  of  effectiveness  in  dealing  with  noisy,  un¬ 
reliable,  and  inconsistent  data;  nonstationary 
and  region-specific  variations;  cases  that  are 
rare  but  significant;  large-scale  problems;  and 
on-line  learning  situations.  These  are  some 
of  the  issues  that  ARTMAP  database  analyses 
have  been  addressing  in  recent  years,  and  new 
projects  will  continue  to  bring  research  proto¬ 
types  closer  to  the  implementation  stage. 
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5.3  Sonar  Target  Recognition 
(Investigators:  L.  Burton,  G.A. 
Carpenter,  M.A.  Rubin,  W.W. 
Streilein) 

Objectives:  A  new  application  project  for 
sonar  target  recognition  is  now  being  developed. 
This  work  would  be  done  in  collaboration  with 
researchers  at  the  Orincon  Corporation  who  are 
currently  funded  by  a  Phase  II  SBIR  contract 
from  the  Office  of  Naval  Research. 

Research  Questions:  The  project  seeks  to 
improve  automated  sonar  recognition  capabil¬ 
ities.  Researchers  at  Orincon  have,  to  date, 
examined  the  recognition  capabilities  of  multi¬ 
layer  perceptrons  and  ellipsoidal  basis  functions, 
the  latter  giving  slightly  better  results.  Varia¬ 
tions  on  these  methods  perform  types  of  sensor 
fusion,  presenting  networks  with  both  sets  of 
processed  data  (SCAT  and  matched  filter)  or 
presenting  a  series  of  inputs  taken  from  a  series 
of  aspects.  The  sonar  recognition  problem  is 
similar  to  problems  on  which  fuzzy  ARTMAP 
and  ART-EMAP  (evidence  MAP)  have  already 
proved  successful.  The  data  set  would  also  be  a 
useful  testbed  for  enhanced  ARTMAP  recogni¬ 
tion  systems  such  as  those  being  developed  for 
radar  recognition  problems. 

Evaluation:  If  pilot  studies  prove  successful, 
large-scale  implementation  will  follow.  This 
type  of  collaborative  interaction  should  accel¬ 
erate  transfer  of  new  basic  research  results  into 
applications  technology  by  reducing  duplicated 
effort,  sharing  software,  and  providing  ongoing 
adaptation  of  basic  neural  network  architectures 
to  the  particular  demands  of  problems  that  are 
of  direct  interest  to  the  military. 

5.4  Radar  Target  Detection 
(Investigators:  G.A.  Carpenter, 
M.A.  Rubin,  W.W.  Streilein) 

Objectives:  Comparative  tests  of  buffering 
and  other  data  compression  methods  on  ex¬ 
tended  sets  of  simulated  radar  range  profiles  will 
continue  in  the  coming  year. 

Evaluation:  Further  development  of 

ARTMAP-FD  (familiarity  discrimination) 


will  focus  on  automated  methods  for  selecting 
the  familiarity  threshold.  The  aim  is  to  produce 
a  network  that  can  perform  on-line  familiarity 
discrimination.  Comparison  with  experiments 
on  human  familiarity  discrimination  may  also 
suggest  computational  improvements  to  the 
neural  network. 

5.5  Fast  Distributed  Learning 

(Investigators:  G.A.  Carpenter, 
B.  Noeske,  W.W.  Streilein) 

Objectives:  Basic  research  to  develop  the  next 
generation  of  neural  network  pattern  recogni¬ 
tion  devices  will  continue  in  parallel  with  tech¬ 
nology  transfer  of  more  established  algorithms. 

Research  Questions:  One  such  project  will 
investigate  the  new  class  of  distributed  ART 
models.  Studies  to  investigate  computational 
properties  of  distributed  ART  (dART)  and  dis¬ 
tributed  ARTMAP  (dARTMAP)  systems  are 
leading  toward  systems  that  expand  applica¬ 
tion  capabilities  of  the  ART  family  of  networks. 
This  project  will  develop  a  set  of  benchmark 
simulation  examples  to  probe  learning  by  dis¬ 
tributed  ARTMAP  models.  These  examples 
will  serve  both  to  help  guide  design  choices  for 
networks  that  have  already  been  specified  and 
to  suggest  new  networks  with  additional  com¬ 
putational  features. 

Evaluation:  Initial  studies  will  examine  low¬ 
dimensional  problems  in  order  to  focus  on  de¬ 
tails  of  system  analysis  and  design.  Later  stud¬ 
ies  will  include  large-scale  problems  such  as 
geospatial  mapping  and  other  applications. 

6  Coherent  Processing  of  Moving 
Targets  (Investigators:  A.  Baloch, 
S.  Grossberg,  J.  Richardson) 

Objectives:  To  develop  the  Motion  Boundary 
Contour  System  (mBCS)  model  for  testing  on 
complex  imagery,  and  to  begin  to  implement  it 
on  real-time  hardware.  As  noted  in  our  Techni¬ 
cal  Report,  the  mBCS  model  realizes  a  process 
of  motion  capture  whereby  objects  that  might 
not  otherwise  be  detected  in  cluttered  and  scin¬ 
tillating  scenes  can  pop-out  coherently  with  a 
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well-defined  motion  direction  and  speed.  The 
model  can  also  be  attentively  primed  to  selec¬ 
tively  track  objects  which  are  moving  in  a  pre¬ 
scribed  direction.  These  properties  are  valuable 
for  detecting  and  tracking  objects  under  diffi¬ 
cult  sensing  conditions. 

Research  Questions:  A  central  research  focus 
is  to  further  develop  our  model  of  how  form-and- 
motion  information  are  fused  together  to  detect 
and  track  moving  objects.  As  noted  in  our  Tech¬ 
nical  Report,  when  objects  are  detected  by  a 
small  number  of  pixels,  either  due  to  their  small 
size  on  the  sensor  or  to  interference  by  many 
noise  pixels,  an  emergent  boundary  segmenta¬ 
tion  may  be  needed  to  detect  them  with  high 
precision,  before  this  segmentation  is  injected 
into  the  motion  system  at  the  proper  processing 
stage.  We  will  further  develop  the  form-motion 
fusion  model  to  do  this  in  response  to  increas¬ 
ingly  complex  image  sequences.  Of  particular 
importance  is  the  question  of  how  the  motions 
of  many  densely  moving  objects  are  not  con¬ 
fused  with  one  another  for  purposes  of  target 
tracking. 

Evaluation:  The  form-motion  fusion  model 
will  be  developed  by  simulating  its  responses 
to  increasingly  complex  moving  visual  shapes, 
starting  with  filled-in  forms,  then  outline  forms, 
then  forms  defined  by  textures  moving  with  re¬ 
spect  to  textured  backgrounds  with  increasing 
levels  of  multiplicative  noise.  After  these  stud¬ 
ies  are  completed,  we  will  further  develop  and 
test  the  model  on  LADAR  and  multispectral  IR 
imagery.  Throughout  this  development  process, 
we  will  also  study  how  the  types  of  processes 
that  are  used  for  form-motion  fusion  may  be 
implemented  on  the  C80  board. 

7  Psychophysical  Experiments  for 
Real  World  Tasks 

7.1  Visual  Search  Experiments  in 
Cluttered  Environments 
(Investigators:  J.  Beck,  R. 
Cunningham,  E.  Mingolla) 

Objectives:  To  help  determine  factors  affect¬ 
ing  human  performance  in  visual  searches  for 
targets  in  clutter  in  grayscale  imagery.  Increas¬ 


ingly  realistic  scenery  will  be  sought  as  mecha¬ 
nisms  of  visual  search  are  better  elucidated. 

Research  Question:  One  important  factor  is 
the  lighting  of  natural  scenes.  Is  the  claimed 
human  biases  for  interpretation  of  visual  scenes 
as  being  illuminated  from  above  true  for  a  vi¬ 
sual  search  task  employing  imagery  more  com¬ 
plex  than  often  employed  in  the  psychophysical 
laboratory? 

Evaluation:  Human  performance  for  finding 
a  target  in  clutter.  on  simple,  computer¬ 
generated  stimuli  will  be  compared  to  that  on 
more  naturalistic  imagery. 

7.2  Experiments  on  the  Perceived 
Segregation  of 

Element- Arrangement  Textures 
(Investigators:  J.  Beck,  E. 
Mingolla,  S.  Oddo 

Objectives:  To  help  determine  factors  affect¬ 
ing  human  performance  in  breaking  camouflage 
through  perceptual  segmentation  of  image  re¬ 
gions  from  chromatic  textural  information. 

Research  Question:  How  do  factors  such  as 
hue  similarity,  or  the  responses  of  retinal  cone 
receptors,  help  to  predict  how  readily  a  person 
can  see  textural  differences  in  colored  imagery? 

Evaluation:  Human  performance  on 

computer-generated  stimuli  has  been  mea¬ 
sured  in  order  to  determine  the  fundamental 
mechanisms  of  human  chromatic  texture 
segregation. 

8  Adaptive  Control  of  Eye 

Movements  (Investigators:  G. 
Arakawa,  D.  Bullock,  G.  Gancarz, 
S.  Grossberg) 

Objectives:  To  integrate  the  multimodal  fu¬ 
sion  model  for  learning  to  compute  ballistic 
movement  decisions  in  response  to  competing 
visual,  auditory,  and  planned  movements  into 
a  larger  system  architecture  for  movement  con¬ 
trol.  This  system  will  be  capable  of  controlling 
both  ballistic  and  continuous  tracking  move¬ 
ments.  The  self-calibrating  capability  of  this 
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tracking  system  will  be  of  value  in  many  appli¬ 
cations  wherein  system  parameters  may  change 
due  to  use  in  the  field.  Psychophysical  experi¬ 
ments  will  also  be  carried  out  to  test  concepts 
concerning  how  human  observers  continuously 
track  moving  targets  under  intermittently  oc¬ 
cluding  conditions.  These  conditions  will  help 
to  evaluate  human  performance  under  these 
conditions,  as  well  as  to  design  better  automatic 
tracking  systems  for  dealing  with  them. 

Research  Questions:  Three  central  questions 
will  be  addressed:  (1)  How  to  combine  ballistic 
and  continuous  tracking  capabilities  into  a  sin¬ 
gle  self-calibrating  system?  In  particular,  how 
is  the  decision  made  as  to  whether  continuous  or 
ballistic  tracking  is  released  in  a  given  context. 
(2)  How  to  transform  the  decisions  made  within 
these  systems  into  the  controller  which  moves 
the  self-calibrating  motor  plant?  (3)  How  to  set 
up  the  learning  capabilities  of  the  system  so  that 
all  the  different  competing  sensory  sources  can 
generate  accurate  movements,  even  though  they 
use  multiple  converging  and  diverging  pathways 
to  move  the  tracking  system. 

Evaluation:  We  will  continue  to  develop  a 
model  of  how  the  brain  accomplishes  these  feats. 
In  particular,  the  ballistic  and  continuous  move¬ 
ments  to  be  modeled  first  are  the  saccadic  and 
smooth  pursuit  movements  of  the  eyes.  A  model 
of  the  interactions  of  the  superior  colliculus, 
frontal  eye  fields,  and  posterior  parietal  cortex 
for  multimodal  fusion  will  be  joined  to  a  model 
of  how  these  regions  interact  with  the  cerebel¬ 
lum,  reticular  formation,  and  eye  muscle  plant 
to  learn  the  correct  gains  whereby  to  generate 
accurate  eye  movements  in  response  to  all  com¬ 
binations  of  movement  commands.  The  model 
will  also  be  designed  to  explain  how  the  human 
smooth  pursuit  system  can  adaptively  maintain 
near-unity  tracking  gains  by  learning  both  to 
reconstruct  target  velocity  and  to  compensate 
for  sensory-motor  processing  lags.  Experiments 
will  be  carried  out  to  measure  a  precise  esti¬ 
mate  of  the  gain  of  smooth  pnrsuit  during  an 
interval  when  a  moving  target  is  temporarily 
occluded.  This  experiment  and  others  will  be 
carried  out  on  our  recently  purchased  eye-head 
tracking  system  from  ISC  AN  Inc.,  which  will  al¬ 
low  a  full  range  of  experiments  to  test  models  of 


eye-head  cooperation  in  ballistic  and  continuous 
tracking. 

9  Head  Mounted  Space- Variant 
Active  Vision  System:  Algorithms 
and  Hardware  (Investigators:  G. 
Bonmassar,  B.  Fischl,  D.  Fraser, 

E.  Schwartz) 

Objectives:  To  build  miniature,  low-cost,  low- 
power,  light  weight,  “wearable”  space- variant 
active  vision  systems,  and  to  provide  algorith¬ 
mic  and  hardware  solutions  towards  this  goal, 
implementing  high  performance  visual  naviga¬ 
tion,  surveillance,  target  acquisition,  and  pat¬ 
tern  recognition.  Applications  include  com¬ 
puter  aided  vision  for  low  light,  night  vision, 
infra-red,  visual  prosthesis,  and  navigation. 

Research  Questions: 

•  Construction  of  wearable  space-variant  ac¬ 
tive  vision  system. 

•  Development  of  high-performance,  real 
time  image  processing  algorithms  for  early 
vision,  navigation  (e.g.,  by  template  match¬ 
ing  to  a  geographic  image  database)  and 
pattern  recognition, (e.g.,  face  recognition 
from  a  database  of  faces)  for  use  in  space- 
variant  vision  applications. 

•  Demonstration  of  man-machine  interface 
demonstration  enabling  blind  subjects  to 
navigate  and  perform  visual  pattern  recog¬ 
nition  via  the  use  of  a  wearable  space- 
variant  active  vision  system. 

•  Build  and  demonstrate  hardware  and  elec¬ 
tronics  fabrication  infrastructure  in  the 
Cognitive  and  Neural  Systems  Department 
of  Boston  University. 

Evaluation:  Demonstration,  in  unconstrained 
“street  environment” ,  of  real-time  miniaturized 
hardware  system,  with  algorithmic  implemen¬ 
tation  of  30  frame/second  performance  on  face- 
recognition,  navigation,  target  acquisition,  and 
pattern  recognition  tasks. 
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10  Robotic  Navigation  under  Visual 

Guidance  (Investigators:  P. 
Gaudiano,  A.  Harner,  E.  Sahin) 

This  research  focuses  on  the  development  of  au¬ 
tonomous  mobile  robots.  Two  different  modules 
for  different  aspects  of  visually  guided  naviga¬ 
tion  will  be  developed. 

Objectives:  The  long-range  goal  is  to  develop 
an  integrated  system  for  control  and  navigation 
of  autonomous  mobile  robots  in  unknown,  un¬ 
structured  environments.  Specifically,  the  goal 
is  to  develop  mobile  robots  that  can  learn  about 
their  own  dynamics,  kinematics,  and  sensory 
apparatus  so  as  to  operate  without  supervi¬ 
sion  in  an  environment  that  is  unknown  or  that 
can  undergo  unexpected  changes.  The  develop¬ 
ment  of  such  a  system  would  have  direct  ap¬ 
plications  for  battlefield  scenarios,  where  au¬ 
tonomous  robots  could  operate  under  adverse 
conditions  on  tasks  such  as  intelligence  gath¬ 
ering,  mine  detection,  and  disabling  of  enemy 
equipment. 

Research  Questions:  Current  state-of-the- 
art  robotics  are  not  yet  capable  of  robust,  adap¬ 
tive  behavior  in  unknown  or  unstructured  envi¬ 
ronments  such  as  a  battlefield.  Teleoperation  is 
only  possible  under  certain  circumstances,  and 
it  still  suffers  from  the  inability  of  he  teleoper- 
ated  vehicle  to  adapt  to  changes  that  might  re¬ 
sult  for  instance  from  damages  components  or 
unreliable  communications  channels.  An  ideal 
autonomous  robot  might  be  able  to  operate  un¬ 
der  teleoperation  when  possible,  but  otherwise 
continue  to  function  even  if  teleoperation  be¬ 
comes  impractical. 

The  development  of  robust,  adaptive  robots  is  a 
research  area  that  encompasses  many  problems, 
such  as  low-level  control,  visual  recognition,  lo¬ 
calization,  and  path  planning.  Algorithms  have 
been  developed  in  each  of  these  sub-areas,  but 
little  exists  in  terms  of  robust,  integrated  sys¬ 
tems  that  combine  the  results  from  most  of,  or 
all  of  these  research  areas. 

This  project  will  avail  itself  of  the  interdisci¬ 
plinary  expertise  that  exists  in  the  Department 
of  Cognitive  and  Neural  Systems.  The  focus 
will  be  on  unsupervised,  autonomous  modules 


for  low-level  control,  visual  processing,  sensor 
fusion,  and  navigation.  These  will  be  modi¬ 
fied  and  enhanced  as  needed  to  function  on  real 
robots. 

Evaluation:  The  models  will  be  tested  on 
real  mobile  robots.  As  is  well  known  in  the 
robotics  community,  implementing  any  model 
on  a  real  robot  is  much  more  challenging  than 
doing  a  computer  simulation.  The  effects  of  re¬ 
alistic  sensor  noise,  unreliable  communications 
and  sudden  changes  in  the  environment  will  be 
evaluated. 

A  criterion  of  success  is  that  the  competence 
arises  in  an  unsupervised  and  self-organizing 
fashion,  so  that  it  can  adapt  to  circumstances 
that  were  not  foreseen  by  the  algorithm’s  de¬ 
signer.  To  demonstrate  robustness,  a  variety 
of  mobile  robot  platforms  are  being  used  (see 
our  Technical  Report),  and  success  will  be  con¬ 
firmed  if  the  algorithms  work  on  all  of  them. 

11  Neuromorphic  VLSI  for  Battle 
Awareness  (Investigators:  A.G. 

Andreou,  G.  Cauwenberghs,  A. 

Hubbard) 

Objectives:  Our  work  focuses  on  circuit  im¬ 
plementations  (VLSI)  of  boundary  and  surface 
completion  networks  (BCS/FCS)  that  enhance 
noisy  images.  We  will  develop  chips  and/or  elec¬ 
tronic  subsystems  that  can  eventually  be  put 
on  chips.  In  the  case  of  SAR,  we  seek  to  speed 
computations  that  would  ordinarily  be  carried 
out  using  computers.  In  such  a  case,  the  objec¬ 
tive  would  not  necessarily  be  to  conserve  power, 
but  to  increase  speed.  In  the  case  of  small-scale 
sensors  on  a  battlefield,  a  low-power  approach 
needs  to  be  taken.  Moreover,  for  cost  purposes, 
the  relative  effectiveness  of  partial  BCS/FCS 
and  ART  learning  systems  that  are  hand-held 
or  involve  telemetry  can  be  of  significant  utility. 

Research  Questions:  The  following  are  key 
scientific  research  and  engineering  issues  that 
will  be  addressed: 

1.  Interchip  communication:  will  be  a 
thrust  research  area  in  the  next  year  as  this 
is  a  key  technology  component  for  integrating 
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larger  systems.  The  neuromorphic  VLSI  chips 
for  the  BCS  model  will  be  refined  to  overcome 
present  deficiencies  and  we  will  aim  towards  fab¬ 
ricating  large  designs  (at  least  128  x  128  cells) 
per  chip.  Work  will  also  continue  on  the  ad¬ 
dress  event  representation  and  the  mapping  of 
the  BCS/FCS  architecture  on  to  this  emerging 
communication/computation  technology. 

2.  Synapse  design:  for  the  ART  family 
of  learning  machines  will  be  revisited  and  we 
will  investigate  the  use  of  fioating  gate  struc¬ 
tures  and  improved  dynamic  analog  storage 
techniques  (ADRAM)  for  multiple  level  storage. 
The  issue  of  whether  multiple  level  analog  stor¬ 
age  is  the  way  of  the  future  is  an  issue  of  hot 
debate  in  the  FLASH  ROM  community. 

3.  Resistive  grid  computation:  is  still  an  is¬ 
sue  that  will  be  investigated  aiming  at  compact, 
low  power,  high  speed  designs  of  local  compu¬ 
tations. 

4.  Technology  limitations:  will  ultimately 
determine  the  performance  of  future  generation 
systems.  We  will  seek  to  develop  a  quantitative 
framework  to  predict  how  the  technology  im¬ 
poses  constraints  on  possible  implementations 
of  the  sophisticated  algorithms  that  are  pro¬ 
posed  today.  This  will  be  done  hierarchically 
at  the  device,  circuit  and  architecture  levels. 

Evaluation:  The  work  on  VLSI  architecture 
and  design  will  be  evaluated  in  the  following 
ways,  that  include  metrics  intrinsic  to  the  VLSI 
design  and  architecture  work,  and  metrics  that 
assess  the  relevance  of  our  work  to  the  program 
objectives. 

The  prime  evaluation  criterion  is  a  performance 
comparison  between  the  actual  hardware  mod¬ 
ules  and  the  software  simulations.  In  collabo¬ 
ration  with  the  group  at  Boston  University  we 
will  run  a  battery  of  tests  using  the  SAR  data 
available  that  have  been  used  for  algorithm  de¬ 
velopment.  Other  relevant  criteria  are  listed  be¬ 
low: 

1.  Computational  throughput  metrics: 

Traditional  measures  of  operations  per  second 
(OPS)  or  channel  capacity  (spatial  and  tem¬ 
poral)  will  be  employed.  In  comparisons  with 
other  systems,  these  will  be  normalized  to  ac¬ 


tual  computing  costs  which  are  power,  size  and 
weight. 

2.  Design  process  metrics:  The  work  will 
be  evaluated  based  on  generated  portable  li¬ 
braries  for  characterized  circuit  designs  that  can 
be  readily  shared  with  government  contractors. 
Behavioral  models  for  circuits  as  well  as  simu¬ 
lators  for  architectural  optimization  will  also  be 
provided  along  with  the  actual  layouts.  Thus 
a  particular  manufacturer  could  “plug  in”  spe¬ 
cific  models  for  their  manufacturing  process,  re¬ 
simulate  the  chip  behavior,  amend  as  appropri¬ 
ate,  and  begin  production. 

3.  Battlefield  awareness  relevance:  The 

success  of  this  endeavor  must  ultimately  related 
to  a  concrete  technology  transfer  effort.  We 
will  deploy  prototyped  systems  in  actual  real 
world  applications  and  problems  of  interest  to 
DOD.  We  are  collaborating  with  colleagues  at 
the  Johns  Hopkins  Applied  Physics  Laboratory 
to  deploy  BCS/FCS  VLSI  designs  that  are 
currently  developed  under  the  MURI  project 
on  the  Flare  Genesis  autonomous  balloon-born 
sun  observatory: 

hurlbut . jhuapl . edu/FlareGenesis/mission.html. 
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Abstract 

We  describe  a  flexible  model  similar  to  (Vetter 
and  Poggio,  1993,  1995  eind  Jones  and  Poggio, 
1995)  for  representing  images  of  objects  of  a  cer¬ 
tain  class,  known  a  priori,  such  as  faces,  and 
introduce  a  new  algorithm  for  matching  it  to  a 
novel  image  and  thereby  perform  image  analy¬ 
sis.  The  flexible  model  is  learned  from  exam¬ 
ple  images  (called  prototypes)  of  objects  of  a 
class.  In  this  paper  we  introduce  an  effective 
stochastic  gradient  descent  algorithm  that  au¬ 
tomatically  matches  a  model  to  a  novel  image 
by  flnding  the  parameters  that  minimize  the  er¬ 
ror  between  the  image  generated  by  the  model 
and  the  novel  image.  Our  approach  can  provide 
novel  solutions  to  several  vision  tasks,  includ¬ 
ing  the  computation  of  image  correspondence, 
object  verification,  image  synthesis  and  image 
compression. 

1  Introduction 

An  important  problem  in  computer  vision  is 
to  model  classes  of  objects  in  order  to  per- 
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Sumitomo  Metal  Industries,  Kodak,  Daimler-Benz  and 
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and  Helen  Whitaker  Chair  at  MIT’s  Whitaker  College. 
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form  image  analysis  by  matching  the  models  to 
views  of  new  objects  of  the  same  class,  thereby 
parzimetrizing  the  novel  image  in  terms  of  a 
known  model.  Many  approaches  have  been  pro¬ 
posed.  Several  of  them  represent  objects  using 
3D  models,  represented  in  different  ways  (for  a 
review,  see  [Besl  and  Jain,  1985]).  Such  mod¬ 
els  are  typically  quite  sophisticated,  difficult  to 
build  and  hard  to  use  for  many  applications  - 
image  matching  in  pzirticular.  A  rather  differ¬ 
ent  approach  is  suggested  by  recent  results  in 
the  science  of  biological  vision. 

There  is  now  convincing  psychophysical  and 
even  physiological  evidence  suggesting  that  the 
human  visual  system  often  uses  strategies  that 
have  the  flavor  of  object  representations  based 
on  2D  rather  than  3D  models  ([Edehnan  and 
Bulthoff,  1990];  [Sinha,  1995];  [Bulthoff  et  al, 
1995];  [Pauls  et  al.,  1996]).  With  this  motiva¬ 
tion,  we  have  explored  an  approach  in  which 
object  models  Me  learned  from  several  proto¬ 
typical  2D  images. 

Though  the  idea  of  synthesizing  a  model  for 
a  class  of  objects,  such  as  faces  or  cars,  is 
quite  attractive,  it  is  fax  from  clear  how  the 
model  should  be  represented  and  how  it  can 
be  matched  to  novel  images.  In  other  papers 
we  have  introduced  a  flexible  model  of  Ein  ob¬ 
ject  class  as  a  linear  combination  of  example 
images,  represented  appropriately.  Image  rep¬ 
resentation  is  the  key  issue  here  because  in  or¬ 
der  for  a  linear  combination  of  images  to  make 
sense,  they  must  be  represented  in  terms  of  el¬ 
ements  of  a  vector  space.  In  particular,  adding 
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two  image  representations  together  must  yield 
another  element  of  the  space.  It  turns  out  that 
a  possible  solution  is  to  set  the  example  images 
in  pixelwise  correspondence.  In  our  approach, 
the  correspondences  between  a  reference  image 
and  the  other  example  images  are  obtained  in  a 
preprocessing  phase.  Once  the  correspondences 
Eire  computed,  an  image  can  be  represented  as 
a  shape  vector  and  a  texture  vector.  The  shape 
vector  specifies  how  the  2D  shape  of  the  ex¬ 
ample  differs  from  the  reference  image.  Analo¬ 
gously,  the  texture  vector  specifies  how  the  tex¬ 
ture  differs  from  the  reference  texture  (here  we 
axe  using  the  term  “texture”  to  mean  simply 
the  pixel  intensities  of  the  image).  The  flexi¬ 
ble  model  for  Ein  object  class  is  thus  a  linear 
combination  of  the  exEimple  shapes  and  textures 
which,  for  given  Vcilues  of  the  parameters,  can 
be  rendered  into  an  image.  The  flexible  model 
is  thus  a  generative  model  which  can  synthesize 
new  images  of  objects  of  the  same  linear  class. 
But  how  can  we  use  it  to  analyze  novel  images? 
In  this  paper  we  provide  an  answer  in  terms  of  a 
novel  algorithm  for  matching  the  flexible  model 
to  a  novel  image. 

The  paper  is  organized  as  follows.  First,  we 
discuss  related  work.  In  section  3  we  explain  in 
detail  our  model.  Section  4  describes  the  match¬ 
ing  Eilgorithm.  Section  5  shows  example  mod¬ 
els  of  object  classes  and  presents  experiments 
on  matching  novel  images.  Section  6  concludes 
with  a  summary  and  discussion. 

2  Related  Work 

The  “linear  class”  idea  of  [Poggio  and  Vetter, 

1992]  and  [Vetter  and  Poggio,  1995]  together 
with  the  image  representation,  based  on  pix¬ 
elwise  correspondence,  used  by  [Beymer  et  al., 

1993]  (see  eiIso  [Beymer  and  Poggio,  1996])  is  the 
main  motivation  for  our  work.  Poggio  and  Vet¬ 
ter  introduced  the  idea  of  Uneax  combinations 
of  views  to  define  and  model  classes  of  objects. 
They  were  inspired  in  tmn  by  the  results  of  [UU- 
mein  and  Basri,  1991]  and  [Shashua,  1992]  who 
showed  that  linear  combinations  of  three  views 
of  a  single  object  may  be  used  to  obtain  any 
other  views  of  the  object  (barring  self-occlusion 
and  assuming  orthographic  projection).  Poggio 
and  Vetter  defined  a  linear  object  class  as  a  set 


of  2D  views  of  objects  which  cluster  in  a  small 
linear  subspace  of  72.^”  where  n  is  the  number 
of  feature  points  on  each  object.  They  showed 
that  in  the  case  of  linear  object  classes  rigid 
transformations  can  be  learned  exactly  from  a 
smeJl  set  of  examples.  Jones  and  Poggio  [Jones 
and  Poggio,  1995]  sketched  a  novel  approach  to 
match  linear  models  to  novel  images  that  can  be 
used  for  several  visual  analysis  tasks,  including 
recognition.  In  this  paper  we  develop  the  ap¬ 
proach  in  detail  and  show  its  performance  not 
only  on  line  drawings  but  also  on  gray-level  im¬ 
ages. 

Recently  we  have  become  aware  of  several  pa¬ 
pers  dealing  with  various  forms  of  the  idea  of 
linear  combination  of  prototypical  images.  Choi 
et.  al.  (1991)  [Choi  et  al,  1991]  were  perhaps 
the  first  (see  also  [Poggio  Eind  BruneUi,  1992])  to 
suggest  a  model  which  represented  face  images 
with  separate  shape  and  texture  components, 
using  a  3D  model  to  provide  correspondences 
between  example  face  images.  In  his  study 
of  illumination  invariant  recognition  techniques, 
HaUinan  [Hallinan,  1995]  describes  deformable 
models  of  a  similar  general  flavor.  The  work 
of  Taylor  and  coworkers  ([Cootes  and  Taylor, 
1992];  [Cootes  and  Taylor,  1994];  [Cootes  et  al., 
1992];  [Hill  et  al.,  1992])  on  active  shape  models 
is  probably  the  closest  to  ours.  It  is  based  on 
the  idea  of  linear  combinations  of  prototypes  to 
model  non-rigid  transformations  within  classes 
of  objects.  However,  they  use  a  very  sparse 
set  of  corresponding  points  in  their  model  (we 
use  dense  pixelwise  correspondences),  and  they 
handle  texture  differently  from  us.  Our  im¬ 
age  representation,  relying  on  shape  and  tex¬ 
ture  vectors  obtained  through  pixelwise  corre¬ 
spondence,  seems  to  be  a  significant  extension 
which  allows  us  to  incorporate  texture  as  weU 
as  shape  seamlessly  into  a  single  framework. 

3  Modeling  Classes  of  Objects 

The  work  of  UUman  and  Basri  [Ullman  and 
Basri,  1991]  and  Poggio  and  Vetter  [Poggio  and 
Vetter,  1992]  was  based  on  a  representation  of 
images  as  vectors  of  the  x,  y  positions  of  a  small 
number  of  labeled  featiues  -  and  left  open  the 
question  of  how  to  find  them  reliably  and  ac¬ 
curately.  Starting  with  [Beymer  et  al.,  1993], 
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we  have  attempted  to  avoid  the  computation  of 
sparse  features  while  keeping  as  much  as  pos¬ 
sible  the  representation  of  UUman  and  Basri 
which  has  -  at  least  under  some  conditions  -  the 
algebraic  structure  of  a  vector  space.  For  images 
considered  as  bitmaps,  on  the  other  hand,  basic 
vector  space  operations  like  addition  and  linear 
combination  are  not  meaningful.  We  have  ar¬ 
gued  therefore  that  a  better  way  to  represent 
images  is  to  associate  with  each  image  a  shape 
vector  and  a  texture  vector  (see  for  a  review 
[Poggio  and  Beymer,  1996]). 

The  shape  vector  oi  an  example  image  associates 
to  each  pixel  in  the  reference  image  the  coordi¬ 
nates  of  the  corresponding  point  in  the  exam¬ 
ple  image.  The  texture  vector  contains  for  each 
pixel  in  the  reference  image  the  color  or  gray 
level  value  for  the  corresponding  pixel  in  the  ex¬ 
ample  image.  We  refer  to  the  operations  which 
associate  the  shape  and  texture  vectors  to  an 
image  as  vectorizing  the  image.  Instead  of  two 
separate  vectors  we  can  also  consider  the  fuU 
texture-shape  vector  which  has  dimensionedity 
N  +  2N  where  N  is  the  number  of  pixels  in  the 
image.  The  term  shape  vector  refers  to  the  2D 
shape  (not  3D!)  relative  to  the  reference  image. 
Note  that  edge  or  contour-based  approaches  are 
a  special  case  of  this  framework.  If  the  images 
used  are  edge  maps  or  line  drawings  then  the 
shape  vector  may  include  only  entries  for  points 
along  an  edge  without  the  need  of  an  explicit 
texture  vector. 

The  shape  and  texture  vectors  form  separate 
linear  vector  spaces  with  specific  properties. 
The  shape  vectors  resulting  from  different  or¬ 
thographic  views  of  a  single  3D  object  (in  which 
features  are  always  visible)  constitute  a  lin¬ 
ear  vector  subspace  of  very  low  dimensional¬ 
ity  spanned  by  just  two  views  ([UUman  and 
Basri,  1991];  see  also  [Poggio,  1990]).  For  a 
fixed  viewpoint  a  specific  class  of  objects  with 
a  similar  3D  structure,  such  as  faces,  seems  to 
induce  a  texture  vector  space  of  relatively  low 
dimensionality  as  shown  indirectly  by  the  re¬ 
sults  of  [Kirby  and  Sirovich,  1990]  and  more 
directly  by  [Lanitis  et  al.,  1995].  Using  pixel- 
wise  correspondence  [Vetter  and  Poggio,  1995] 
and  [Beymer  and  Poggio,  1995]  showed  that  a 


good  approximation  of  a  new  face  image  can 
be  obtained  with  as  few  as  50  base  faces,  sug¬ 
gesting  a  low  dimensioneility  for  both  the  shape 
and  the  texture  spaces.  As  reviewed  by  [Poggio 
and  Beymer,  1996]  correspondence  and  the  re¬ 
sulting  vector  structure  underlie  many  of  the  re¬ 
cent  view-based  approaches  to  recognition  and 
detection  either  implicitly  or  explicitly. 

Certain  special  object  classes  (such  as  cuboids 
and  symmetric  objects)  can  be  proved  to  be 
exactly  linear  classes  (see  [Poggio  and  Vetter, 
1992]).  Later  in  the  paper  we  wiU  show  that 
there  are  classes  of  objects  the  images  of  which 
-  for  similar  view  angle  and  imaging  parame¬ 
ters  -  can  be  represented  satisfactorily  as  a  lin¬ 
ear  combination  of  a  relatively  small  number  of 
prototype  images. 

3.1  Formal  specification  of  the  model 

In  this  section  we  wiU  formally  specify  our 
model  which  we  refer  to  as  a  linear  object  class 
model.  An  image  I  is  viewed  as  a  mapping 

I -. 'R}  I 

such  that  I{x,y)  is  the  intensity  value  of  point 
[x,y)  in  the  image.  I  =  [0,a]  is  the  range  of 
possible  gray  level  values.  For  eight  bit  images, 
a  =  255.  Here  we  are  only  considering  gray 
level  images,  although  color  images  could  also 
be  handled  in  a  straightforward  manner.  To 
define  a  model,  a  set  of  example  images  called 
prototypes  are  given.  We  denote  these  proto¬ 
types  as  .  •  ■,In-  bet  Iq  be  the  reference 

image.  The  pixelwise  correspondences  between 
Iq  and  each  example  image  are  denoted  by  a 
mapping 

Sj  ■.'R? 

which  maps  the  points  of  Jq  onto  /j,  i.e. 
Sj{x,y)  =  {x,y)  where  {x,y)  is  the  point  in  Ij 
which  corresponds  to  [x,  y)  in  Iq.  We  refer  to  Sj 
as  a  correspondence  field  and  interchangeably  as 
the  shape  vector  for  the  vectorized  Ij.  We  define 

Tj{x,  y)  =  Ij  o  Sj{x,  y)  =  Ij{Sj{x,  y)).  (1) 

Tj  is  the  warping  of  image  Ij  onto  the  refer¬ 
ence  image  Iq.  In  other  words,  Tj  is  the  set 
of  shape-free  prototype  images  -  shape  free  in 
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the  sense  that  their  shape  is  the  same  as  the 
shape  of  the  reference  image.  The  idea  of  the 
model  is  to  combine  linearly  the  textures  of  the 
prototypes  all  warped  to  the  shape  of  the  ref¬ 
erence  image  Eind  therefore  in  correspondence 
with  each  other.  The  resulting  textme  vector 
can  then  be  warped  to  any  of  the  shapes  de¬ 
fined  by  the  linear  combination  of  prototypical 
shapes. 

More  formally,  the  flexible  model  is  defined  as 
the  set  of  images  parameterized  by  b  = 

[bo,  bi,...,  6jv],  c  =  [co,  Cl, . . . ,  cjv]  such  that 

imodelo(£ciSi)  =  '£bjTj.  (2) 

1=0  j=o 

The  summation  Y^^o  describes  the  shape 
of  every  model  image  as  a  linear  combination 
of  the  prototype  shapes.  Similarly,  the  summa¬ 
tion  Yf=objTj  describes  the  texture  of  every 
model  image  as  a  linear  combination  of  the  pro¬ 
totype  textures.  Note  that  the  coefldcients  for 
the  shape  and  texture  parts  of  the  model  are 
independent. 

In  order  to  allow  the  model  to  handle  transla¬ 
tions,  rotations,  scaling  and  shearing,  a  global 
affine  transformation  is  also  added.  The  equa¬ 
tion  for  the  model  images  can  now  be  written 

N  N 

O  (A  O  aSi)  =  Y.  (3) 

t=0  J=0 

where  A  :  TZ^  ^  is  an  affine  transformation 

A{x,y)  =  {pox+piy  +  P2,  Pax+Piy  +  Ps)-  (4) 

The  constraint  Yi^o  =  1  is  imposed  to  avoid 
redimdancy  in  the  paraimeters  since  the  aflme 
parameters  allow  for  changes  in  scaile. 

When  used  as  a  generative  model,  for  given  val¬ 
ues  of  c,  b  and  p,  the  model  image  is  rendered 
by  computing  in  equation  3.  For  Einalysis 

the  gocil  is,  given  a  novel  image  to  find 

peirameter  values  that  generate  a  model  image 
as  similar  as  possible  to  7"°"®^  The  next  section 
describes  how. 


4  Matching  the  Model 

The  analysis  problem  is  the  problem  of  match¬ 
ing  the  flexible  model  to  a  novel  image.  The 
general  strategy  is  to  define  an  error  between 
the  novel  image  and  the  current  guess  for  the 
closest  model  image.  We  then  try  to  minimize 
this  error  with  respect  to  the  linear  coefficients 
c  and  b  and  the  affine  parameters  p.  Follow¬ 
ing  this  strategy,  we  define  the  sum  of  squared 
differences  error 

y)  -  y)f 

(5) 

where  the  sum  is  over  aU  pixels  {x,y)  in  the  im¬ 
ages,  is  the  novel  gray  level  image  being 

matched  and  7”‘°‘^®'  is  the  current  guess  for  the 
model  gray  level  image.  Equation  3  suggests  to 
compute  7’"°'^®*  working  in  the  coordinate  sys¬ 
tem  of  the  reference  image.  To  do  this  we  sim¬ 
ply  apply  the  shape  transformation  (given  esti¬ 
mated  values  for  c  and  p)  to  and  compare 

it  to  the  shape-free  model,  that  is 

E(c,b,p)=  (6) 

1  J2[inovel  o  (A  0  ^  aSi)  -  Y  y)]\ 

x,y  1=0  j=0 

Minimizing  this  error  yields  the  model  image 
which  best  fits  the  novel  image  with  respect  to 
the  L2  norm.  We  use  here  the  L2  norm  but 
other  norms  may  also  be  appropriate  (e.g.  ro¬ 
bust  statistics). 

In  order  to  minimize  the  error  function  any 
standard  minimization  algorithm  could  be  used. 
We  have  chosen  to  use  the  stochastic  gradient 
descent  algorithm  [Viola,  1995]  because  it  is  fast 
and  can  avoid  remaining  trapped  in  local  min¬ 
ima. 

The  summation  in  equation  7  is  over  all  pix¬ 
els  in  the  model  image.  The  idea  of  stochastic 
gradient  descent  is  to  randomly  sample  a  small 
set  of  pixels  from  the  image  and  only  compute 
the  gradient  at  those  pixels.  This  gives  an  esti¬ 
mate  for  the  true  gradient  [Viola,  1995].  In  our 
experiments  we  typically  choose  only  40  points 
per  iteration  of  the  stochastic  gradient  descent. 
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This  results  in  a  large  speedup  over  minimiza¬ 
tion  methods  -  such  as  conjugate  gradient  - 
which  compute  the  full  gradient  over  the  whole 
image. 

Stochastic  gradient  descent  requires  the  deriva¬ 
tive  of  the  error  with  respect  to  each  parameter. 
These  derivatives  can  be  calculated  straight¬ 
forwardly  and  Eire  given  in  [Jones  and  Poggio, 
1996]. 

Our  matching  algorithm  uses  a  coarse-to-fine 
approach  to  improve  robustness  ([Brnt  and 
Adelson,  1983];  [Burt,  1984])  by  creating  a  pyra¬ 
mid  of  images  with  each  level  of  the  pyramid 
containing  an  image  that  is  one  fourth  the  size 
of  the  one  below.  The  correspondence  fields 
must  also  be  subsampled,  and  eiU  x  and  y  co¬ 
ordinates  must  be  divided  by  two  at  each  level. 
The  optimization  algorithm  is  first  used  to  fit 
the  model  parameters  starting  at  the  coarsest 
level.  The  resulting  parameter  values  are  then 
used  as  the  starting  point  at  the  next  level.  The 
translational  affine  parameters  (p2  and  ps)  are 
multiplied  by  2  as  they  are  passed  down  the 
pyramid  to  account  for  the  increased  size  of  the 
images.  Section  5  shows  a  number  of  examples 
to  illustrate  the  performance  of  the  matching 
algorithm. 

Another  useful  technique  implemented  in  our 
model  is  principal  components  einalysis.  The 
eigenvectors  for  both  the  shape  space  and  tex¬ 
ture  space  are  computed  independently.  The 
shape  space  cind  texture  space  representations 
can  then  be  effectively  “compressed”  by  us¬ 
ing  only  the  first  few  eigenvectors  (with  largest 
eigenvalues)  in  the  linear  combinations  of  the 
model  (both  for  shape  and  texture).  We  empha¬ 
size  that  the  technique  performs  well  without 
using  eigenvectors,  which  however  can  provide 
additional  computational  efficiency. 

5  Examples  and  Results 

The  model  described  in  the  previous  sections 
was  tested  on  two  different  classes  of  objects. 
The  first  was  the  class  of  frontal  views  of  hu¬ 
man  faces.  A  face  database  from  David  Beymer, 
formerly  of  the  MIT  AI  Lab  [Beymer,  1996], 
was  used  to  create  a  model  of  the  class  of  faces. 
(For  a  second  face  model  see  [Jones  and  Pog¬ 


gio,  1996].)  The  Beymer  face  database  -  con¬ 
sisting  of  62  faces  -  had  been  set  into  corre¬ 
spondence  by  manually  specifying  a  number  of 
corresponding  points  and  then  interpolating  to 
get  a  dense  correspondence  field.  The  second 
example  object  class  was  a  set  of  side  views 
of  cars  which  used  40  car  images.  The  corre¬ 
spondences  for  this  class  were  also  computed  by 
manually  specifying  a  number  of  corresponding 
points  and  then  interpolating. 

The  matching  algorithm  as  described  in  sec¬ 
tion  4  was  run  with  the  following  parameters. 
The  number  of  samples  randomly  chosen  by  the 
stochastic  gradient  descent  algorithm  per  itera¬ 
tion  was  40  and  it  was  run  for  8000  iterations 
per  pyramid  level.  Three  levels  were  used  in  the 
image  pyramids. 

The  running  time  of  the  matching  algorithm  for 
the  Beymer  faces  using  61  prototypes  (with  no 
compression)  was  about  9  minutes.  For  the  cars, 
the  running  time  was  about  5  minutes  using  all 
40  prototypes  (with  no  compression). 

5.1  Faces 

The  face  model  was  built  from  the  Beymer  face 
database,  shown  in  figure  1.  These  images  were 
226  pixels  high  by  184  pixels  wide.  The  face 
in  the  upper  left  corner  was  used  as  the  refer¬ 
ence  face.  The  correspondences  from  the  ref¬ 
erence  face  to  each  of  the  other  faces  was  given 
by  manually  specifying  a  small  number  of  corre¬ 
sponding  points.  These  faces  axe  difficult  to  put 
in  correspondence  due  to  the  hair,  beards  and 
mustaches.  The  results  of  testing  the  model’s 
ability  to  match  novel  faces  are  shown  in  figure 
2.  Because  we  only  had  62  faces  to  work  with, 
61  were  used  in  the  model  and  one  was  left  out 
so  that  it  could  be  used  as  a  novel  image.  Since 
the  hair  is  not  modelled  well,  the  faces  in  figure 
2  have  been  cropped  appropriately.  One  can 
see  that  the  matching  algorithm  produced  good 
matches  to  the  novel  faces. 

5.2  Cars 

As  another  example  of  a  linear  object  class 
model,  we  chose  side  views  of  cars.  The  cax  im¬ 
ages  are  96  pixels  high  by  256  pixels  wide.  Forty 
excimples  of  side  views  of  cars  were  used  to  build 
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Figure  1 :  Database  of  62  prototype  faces  used  to  create  the  model  of  human  faces. 

the  model.  The  example  images  were  similar  6  Conclusions 
to  the  novel  car  images  shown  in  figure  3.  The 

model  defined  by  these  prototypes  and  their  cor-  described  a  flexible  model 

respondences  was  tested  on  its  ability  to  match  represent  images  of  a  class  of  objects  and  in 

a  number  of  novel  cars.  Figure  3  shows  some  particular  how  to  use  it  to  analyze  new  images 
example  matches  to  novel  cars.  The  novel  cars  represent  them  in  terms  of  the  model.  Our 

are  matched  reasonably  well,  although  not  as  flexible  model  does  not  need  to  be  handcrafted 

sharply  as  the  faces.  This  is  probably  due  to  directly  ’’learned”  from  a  small  set  of 

the  fact  that  the  correspondences  given  for  the  images  of  prototypical  objects.  The  key  idea  un¬ 
car  prototypes  are  not  as  accurate  as  in  the  case  flerlying  the  flexible  model  is  a  representation  of 

for  faces.  The  main  reason  for  this  is  the  vari-  images  that  relies  on  the  computation  of  dense 

ability  of  appearance  between  cars:  some  cars  correspondence  between  images.  In  this  repre- 

have  features  that  do  not  appear  on  other  cars  sentation,  the  set  of  images  is  endowed  with  the 

(such  as  spoilers).  Defining  correspondences  for  algebraic  structure  of  a  linear  vector  space, 

such  features  is  ambiguous. 

The  main  contribution  of  this  paper  is  to  solve 

the  analysis  problem:  how  to  apply  the  flexi- 
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Input  image  Output  of  matching  algorithm 


Figtire  2:  Four  examples  of  matching  the  face 
model  to  a  novel  face  image. 


Input  image  Output  of  matching  algorithm 


_  «Bir 


Figure  3:  Three  examples  of  matching  the  car 
model  to  a  novel  car  image,  which  was  not 
among  the  model  prototypes. 


ble  model,  so  far  used  as  a  generative  model, 
for  image  analysis.  Key  to  the  analysis  step  is 
matching  and  a  new  matching  algorithm  is  the 
main  focus  of  this  paper.  The  matching  algo¬ 
rithm  has  been  shown  to  be  robust  to  changes 
in  position,  rotation  and  scale  of  the  novel  input 
image.  It  can  also  match  partially  occluded  in¬ 
put  images  [Jones  and  Poggio,  1996].  We  have 
also  described  in  more  detail  than  in  any  previ¬ 
ous  paper  the  model  itself  and  how  to  obtain  it 
and  learn  it  from  protoype  images  through  pix- 
elwise  correspondence.  Analysis  coupled  with 
synthesis  ojffers  a  number  of  significant  new  ap¬ 
plications  of  the  flexible  model,  including  recog¬ 
nition,  image  compression,  correspondence  and 
learning  of  visual  tasks  in  a  top-down  way,  spe¬ 
cific  to  object  classes  (e.g.  estimation  of  con¬ 
tours,  shape  and  color).  These  appHcations  are 
discussed  in  more  detail  in  [Jones  and  Poggio, 
1996]. 

7  Acknowledgements 

The  authors  would  like  to  thank  Thomas  Vet¬ 
ter,  Anfmon  Shashua,  Federico  Girosi,  Paul  Vi¬ 
ola  and  Tony  Ezzat  for  helpful  comments  and 
discussions  about  this  work.  Michael  Oren  sug¬ 
gested  the  notation  of  section  3.1  in  addition  to 
many  other  very  useful  comments. 

References 
[Besl  and  Jain,  1985] 

Paul  J.  Besl  and  Ramesh  C.  Jain.  Three- 
dimensional  object  recognition.  Computing 
Surveys,  17(1):75-145,  1985. 

[Beymer  and  Poggio,  1995]  David  Beymer  and 
Tomaso  Poggio.  Face  recognition  from  one 
example  view.  A.I.  Memo  1536,  MIT,  1995. 

[Beymer  and  Poggio,  1996]  David  Beymer  and 
Tomaso  Poggio.  Image  representations  for  vi¬ 
sual  learning.  Science,  272:1905-1909,  June 
1996. 

[Beymer  et  al.,  1993]  D.  Beymer,  A.  Shashua, 
and  T.  Poggio.  Example  based  image  analysis 
and  synthesis.  A.I.  Memo  1431,  MIT,  1993. 

[Beymer,  1996]  David  Beymer.  Pose-Invariant 
Face  Recognition  Using  Real  and  Virtual 


Views.  PhD  thesis,  Massachussetts  Institute 
of  Technology,  1996. 

[BiilthofF  et  al.,  1995]  H.  H.  BulthofF,  S.  Y. 
Edelman,  and  M.  J.  Tarr.  How  are  three- 
dimensional  objects  represented  in  the  brain? 
Cerebral  Cortex,  5(3):247-260, 1995. 

[Burt  and  Adelson,  1983]  Peter  J.  Burt  and 
Edward  J.  Adelson.  The  laplacian  pyramid  as 
a  compact  image  code.  IEEE  Transactions  on 
Communications,  COM-31(4):532-540, 1983. 

[Burt,  1984]  Peter  J.  Burt.  The  pyramid  as  a 
structTire  for  efficient  computation.  In  Multi- 
Resolution  Image  Processing  and  Analysis, 
pages  6-37.  Springer- Verlag,  1984. 

[Choi  et  al,  1991]  Chang  Seok  Choi,  Torn 
Okazaki,  Hiroshi  Haxashima,  and  Tsuyoshi 
Takebe.  A  system  of  analyzing  and  synthe¬ 
sizing  facial  images.  IEEE,  pages  2665-2668, 
1991. 

[Cootes  and  Taylor,  1992]  T.F.  Cootes  and 

C. J.  Taylor.  Active  shape  models  -  ’smart 
snakes’.  British  Machine  Vision  Conference, 
pages  266-275, 1992. 

[Cootes  and  Taylor,  1994] 

T.F.  Cootes  and  C.J.  Taylor.  Using  grey- 
level  models  to  improve  active  shape  model 
search.  International  Conference  on  Pattern 
Recognition,  pages  63-67,  1994. 

[Cootes  et  al,  1992]  T.F.  Cootes,  C.J.  Taylor, 

D. H.  Cooper,  and  J.  Graham.  Training 
models  of  shape  from  sets  of  examples. 
British  Machine  Vision  Conference,  pages  9- 
18, 1992. 

[Cootes  et  al,  1993]  T.F.  Cootes,  C.J.  Taylor, 
A.  Lanitis,  D.H.  Cooper,  and  J.  Graham. 
Building  and  using  flexible  models  incorpo¬ 
rating  grey-level  information.  In  ICCV,  pages 
242-246,  Berlin,  May  1993. 

[Edehnan  and  Bulthoff’,  1990]  Shimon 

Edelman  and  Heinrich  BultholF.  Viewpoint- 
speciflc  representations  in  three  dimensional 
object  recognition.  A.I.  Memo  1239,  MIT, 
1990. 


[Hallinan,  1995]  Peter  Winthrop  HaUinan.  A 
Deformable  Model  for  the  Recognition  of  Hu¬ 
man  Faces  Under  Arbitrary  Illumination. 
PhD  thesis.  Harvard  University,  1995. 

[Hill  et  al,  1992]  A.  Hill,  T.F.  Cootes,  and  C.J. 
Taylor.  A  generic  system  for  image  inter¬ 
pretation  using  flexible  templates.  British 
Machine  Vision  Conference,  pages  276-285, 
1992. 

[Jones  and  Poggio,  1995]  Michael  Jones  and 
Tomaso  Poggio.  Model-based  matching  of 
line  drawings  by  linear  combinations  of  pro¬ 
totypes.  In  Proceedings  of  the  Fifth  Interna¬ 
tional  Conference  on  Computer  Vision,  pages 
531-536, 1995. 

[Jones  and  Poggio,  1996]  Michael  Jones  and 
Tomaso  Poggio.  Model-based  matching  by 
linear  combinations  of  prototypes.  A.I.  Memo 
1583,  MIT,  1996. 

[Kirby  and  Sirovich,  1990] 

M.  Kirby  and  L.  Sirovich.  The  application  of 
the  karhunen-loeve  procedure  for  the  charac¬ 
terization  of  human  faces.  IEEE,  12(1):103- 
108,  January  1990. 

[Lanitis  et  al,  1995]  A.  Lanitis,  C.J.  Taylor, 
and  T.F.  Cootes.  A  unified  approach  to  cod¬ 
ing  and  interpreting  face  images.  In  ICCV, 
pages  368-373,  Cambridge,  MA,  Jtme  1995. 

[Pauls  et  al,  1996]  J.  Pauls,  E.  Bricolo,  and 

N. K.  Logothetis.  Physiological  evidence  for 
viewer  centered  representation  in  the  monkey. 
In  S.  Nayar  and  T.  Poggio,  editors.  Early  Vi¬ 
sual  Learning.  Oxford  University  Press,  1996. 

[Poggio  arid  Beymer,  1996]  Tomaso  Poggio  and 
David  Beymer.  Learning  to  see.  IEEE  Spec¬ 
trum,  pages  60-69,  1996. 

[Poggio  and  Brimelli,  1992]  Tomaso  Poggio 
and  Roberto  BruneUi.  A  novel  approach  to 
graphics.  A.I.  Memo  1354,  MIT,  1992. 

[Poggio  and  Vetter,  1992]  Tomaso  Poggio  and 
Thomas  Vetter.  Recognition  and  structure 
from  one  2d  model  view:  Observations  on 
prototypes,  object  classes  and  symmetries. 
A.I.  Memo  1347,  MIT,  1992. 


1364 


[Poggio,  1990]  Tomaso  Poggio.  A  theory  of  how 
the  brain  might  work.  A.I.  Memo  1253,  MIT, 
1990. 

[Shashua,  1992]  Amnon  Shashua.  Projective 
structure  firom  two  uncalibrated  images: 
Structure  from  motion  and  recognition.  A.I. 
Memo  1363,  MIT,  1992. 

[Sinha,  1995]  P.  Sinha.  Perceiving  and  recog¬ 
nizing  3D  forme.  PhD  thesis,  Massachussetts 
Institute  of  Technology,  1995. 

[Turk  and  Pentland,  1991]  M.A.  Turk  and  A.P 
Pentland.  Face  recognition  using  eigenfaces. 
In  IEEE  Conference  on  Computer  Vision  and 
Pattern  Recognition,  pages  586-591, 1991. 


[TJUman  and  Basri,  1991] 

S.  Ullman  and  R.  Basri.  Recognition  by  lin¬ 
ear  combinations  of  models.  IEEE  Traneac- 
tions  on  Pattern  Analysis  and  Machine  Intel¬ 
ligence,  13:992-1006, 1991. 

[Vetter  and  Poggio,  1995]  Thomas  Vetter  and 
Tomaso  Poggio.  Linear  object  classes  and  im¬ 
age  synthesis  from  a  single  example  image. 
A.I.  Memo  1531,  MIT,  1995. 

[Vetter  et  al.,  1996]  Thomas  Vetter,  Michael 
Jones,  and  Tomaso  Poggio.  A  bootstrapping 
algorithm  for  learning  linearized  models  of 
object  classes,  submitted,  1996. 

[Viola,  1995]  Paul  Viola.  Alignment  by  max¬ 
imization  of  mutual  information.  MIT  A.I. 
Technical  Report  1548,  MIT,  1995. 


1365 


Orientation  Behavior  Using  Registered  Topographic 

Maps 

C.  Ferrell 

Massachusetts  Institute  of  Technology  Cambridge  MA 


Abstract 

The  ability  to  orient  toward  visual,  auditory,  or  tac¬ 
tile  stimuli  is  an  important  skill  for  systems  intended 
to  interact  with  and  explore  their  environment.  In 
the  brain  of  mammalian  vertebrates,  the  Superior 
Colliculus  is  specialized  for  integrating  multi-modal 
sensory  information,  and  for  using  this  information 
to  orient  the  animal  to  the  source  sensory  stimuli, 
such  as  noisy,  moving  objects.  Within  the  Superior 
Colliculus,  this  ability  appears  to  be  implemented 
using  layers  of  registered,  multi-modal,  topographic 
maps.  Inspired  by  the  structure,  function,  and  plas¬ 
ticity  of  the  Superior  Colliculus,  we  are  in  the  pro¬ 
cess  of  implementing  multi-modal  orientation  behav¬ 
iors  on  our  humanoid  robot  using  registered  topo¬ 
graphic  maps. 

1  Introduction 

The  ability  to  orient  to  sensory  stimuli  is  an  im¬ 
portant  skill  for  autonomous  agents  that  operate  in 
complex,  dynamic  environments.  In  animals,  ori¬ 
entation  behavior  serves  to  direct  the  the  animal’s 
eyes,  ears,  nose,  and  other  sensory  organs  to  the 
source  of  sensory  stimulation.  By  doing  so,  the  an¬ 
imal  is  poised  to  assess  and  explore  the  nature  of 
the  stimulus  with  complementary  sensory  systems, 
which  in  turn  affects  and  guides  ensuing  behavior. 
Hence,  orientation  behavior  is  performed  frequently 
and  repeatedly  by  agents  that  are  tightly  coupled 
with  their  environment,  where  perception  guides  ac¬ 
tion  and  behavior  assists  in  more  effective  percep¬ 
tion. 

Our  approach  to  implementing  orientation  behav¬ 
ior  on  Cog,  our  humanoid  robot  [4],  is  heavily  in¬ 
spired  by  relevant  work  in  neuroscience  [10],  [2].  In 
the  brain  of  mammalian  vertebrates,  the  Superior 
Colliculus  is  an  organ  specialized  for  producing  ori¬ 
entation  behavior.  In  non-mammalian  vertebrates 
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(birds,  amphibians,  etc.),  the  optic  tectum  is  the 
analogous  organ.  The  structure  of  the  Superior  Col¬ 
liculus  is  characterized  by  layers  of  topographically 
organized  maps.  Collectively,  they  represent  the  sen¬ 
sorimotor  space  of  the  animal  in  ego-centered  coor¬ 
dinates.  These  maps  are  interconnected  and  interact 
is  such  a  way  that  the  animal  performs  orientation 
movements  in  response  to  sensory  stimuli. 

Topographically  organized  maps  have  been  discov¬ 
ered  throughout  the  brain  of  mammalian  verte¬ 
brates.  In  addition  to  the  Superior  Colliculus,  they 
have  been  identified  in  various  perceptual  areas  of 
the  neocortex  (the  visual,  auditory  and  somatosen¬ 
sory  corticies,  for  instance).  It  is  widely  recognized 
that  the  organization  of  these  maps  are  plastic  and 
can  be  shaped  through  experience.  Subsequently, 
cortical  maps  have  garnered  a  lot  of  attention,  and 
a  variety  of  work  has  explored  the  phenomena  of 
self-organizing  feature  maps,  [6],  [8],  [7]. 

The  rest  of  this  paper  is  organized  as  follows.  First 
we  will  briefly  cover  the  organization,  structure,  and 
function  of  the  Superior  Colliculus,  as  our  imple¬ 
mentation  is  strongly  inspired  by  what  is  understood 
about  this  organ.  Next  we  present  the  state  of  our 
implementation  at  the  time  this  paper  was  written, 
as  well  as  extensions  currently  under  development. 
Finally,  we  present  tests  and  results  of  our  system 
to  date,  and  conclude  with  a  brief  description  of  on¬ 
going  work  and  future  directions. 


2  The  Superior  Colliculus 


The  Superior  Colliculus  is  a  midbrain  structure  com¬ 
posed  of  seven  laminar  layers.  The  deep  layers  are 
those  believed  to  play  a  role  in  orientation  behav¬ 
ior.  An  important  function  of  the  Superior  Collicu¬ 
lus  is  to  pool  sensory  inputs  from  different  modalities 
and  redirect  the  corresponding  sensory  organs  (eyes, 
ears,  nose)  to  fixate  on  the  source  of  the  signal. 
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2.1  Organization  of  the  Superior  Col¬ 
liculus 

Localized  regions  of  the  Superior  Colliculus  consist 
of  neurons  with  receptive  fields  that  form  topologi¬ 
cally  organized  maps.  Each  map  corresponds  to  ei¬ 
ther  a  single  modality  or  a  combination  of  modal¬ 
ities.  In  the  cat,  there  are  visuotopic  maps  repre¬ 
senting  motion  in  visual  space,  somatotopic  maps 
yielding  a  body  representation  of  tactile  inputs,  and 
spatiotopic  maps  of  auditory  space  encoding  inter- 
aural  time  differences  (ITD)  and  inter-aural  inten¬ 
sity  differences.  Hence,  a  sensory  stimulus  origi¬ 
nating  from  a  given  direction  will  elicit  activity  in 
the  corresponding  region  of  the  appropriate  sensory 
map.  There  are  also  motor  movement  maps  consist¬ 
ing  of  pre-motor  neurons  whose  movement  fields  are 
topologically  organized.  In  the  cat,  these  exist  for 
the  eyes,  head,  neck,  body,  ears. 

2.2  The  Role  of  Map  Registration 

These  multi-modal  maps  overlap  and  are  aligned 
with  each  other  so  that  they  share  a  common  mul- 
tisensory  spatial  coordinate  system.  The  maps  are 
said  to  be  registered  with  one  another  when  this  is 
the  case.  Arranging  multi-modal  information  into 
a  common  representational  framework  within  each 
map  and  aligning  them  allows  the  information  to  in¬ 
teract  and  influence  each  other.  There  are  several 
advantages  to  this  organizational  strategy.  First,  it 
is  an  economical  way  of  specifying  the  location  of 
peripheral  stimuli,  and  for  organizing  and  activat¬ 
ing  the  motor  program  required  to  orient  towards 
it;  thereby  allowing  any  sensory  modality  to  orient 
the  other  sensory  organs  to  the  source  of  stimula¬ 
tion.  Second,  it  supports  enhancement  of  simulta¬ 
neous  sensory  cues.  Stimuli  that  occur  in  the  same 
place  at  the  same  time  are  likely  to  be  interrelated 
by  common  causality.  For  instance,  a  bird  rustling  in 
the  bushes  will  provide  both  visual  motion  and  au¬ 
ditory  cues.  During  enhancement,  certain  combina¬ 
tions  of  meaningful  stimulus  become  more  salient  be¬ 
cause  their  neuronal  responses  are  spatio-temporally 
related.  Once  the  multi-modal  maps  are  aligned, 
neuronal  enhancement  (or  depression)  is  a  function 
of  the  temporal  and  spatial  relationships  of  neural 
activity  among  the  maps. 

2.3  Development  and  Experience  De¬ 
pendent  Plasticity 

During  development,  the  organization  of  the  topo¬ 
graphic  maps  as  well  as  the  registration  between  dif¬ 
ferent  maps  is  plastic.  For  each  map,  its  representa¬ 
tion  of  space  is  use  dependent.  A  number  of  people 


have  modeled  the  phenomena  of  self  organizing  fea¬ 
ture  maps  (SOFMs)  using  neural  networks  [6],  [1], 
[5].  Plasticity  of  map  registration  has  been  stud¬ 
ied  in  the  inferior  and  superior  colliculus  of  young 
barnyard  owls,  where  the  registration  of  the  auditory 
map  to  the  visual  map  shifts  according  to  experience 

[2],  [3]. 

3  A  Developmental  Approach  to  Ori¬ 
entation  Behavior 

Registered  topographic  maps  form  a  substrate  upon 
which  multi-modal  information  can  be  integrated  to 
produce  coherent  behavior.  How  are  these  topo¬ 
graphic  maps  formed?  How  do  they  become  reg¬ 
istered  with  one  another?  How  is  the  organization 
of  the  ensemble  guided  by  experience? 

3.1  The  Framework 

In  our  framework,  a  map  is  a  two  dimensional  ar¬ 
ray  of  elements  where  each  element  corresponds  to 
a  site  in  the  map.  The  maps  are  arranged  into  in¬ 
terconnected  layers,  where  a  given  map  can  be  in¬ 
terfaced  to  more  than  one  map.  Each  connection 
is  uni-directional,  so  recurrent  connections  between 
maps  require  both  a  feedforward  connection  and  a 
feedback  connection.  The  activity  level  of  sites  on 
one  map  is  passed  to  another  map  thorough  these 
connections,  hence  the  input  to  a  given  map  is  a 
function  of  the  spatio-temporal  activity  of  the  maps 
feeding  into  it  and  the  connectivity  between  these 
maps.  Currently,  all  connections  have  equal  weights, 
although  this  could  change  in  the  future.  The  out¬ 
put  of  a  given  map  is  its  spatio-temporal  activity 
pattern.  What  this  pattern  of  activity  represents 
depends  upon  the  map:  if  it  is  a  visuotopic  map, 
it  could  represent  motion  coming  from  a  particular 
direction  in  the  visual  field;  if  it  is  an  oculomotor 
map,  it  could  encode  a  motor  command  to  move  the 
eyes,  and  so  forth. 

The  smallest  map  ensemble  capable  of  producing  an 
observable  behavior  consists  of  a  sensory  input  map, 
a  motor  output  map,  and  an  established  set  of  con¬ 
nections  between  them.  The  input  map  could  have 
a  fairly  rigid  structure  consisting  simply  of  time- 
differenced  intensity  images.  Because  visual  infor¬ 
mation  already  contains  a  spatial  component,  this 
simple  map  is  topographic  without  any  additional 
tuning.  The  motor  map  could  also  be  fixed  where 
a  given  site  on  the  map  corresponds  to  a  given  mo¬ 
tor  command.  If  the  motor  commands  vary  linearly 
with  motor  space,  for  instance,  this  map  is  also  to¬ 
pographically  organized.  Assuming  the  cameras  are 
motionless,  a  moving  object  occupies  a  localized  re- 
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gion  in  the  visual  field,  and  correspondingly  causes 
a  localized  intensity  difference  (an  active  region)  in 
the  time-differenced  image  map.  If  there  exists  con¬ 
nections  from  this  region  of  the  time-difference  map 
to  the  appropriate  region  of  the  oculomotor  map, 
then  a  motion  stimulus  in  the  visual  field  activates 
the  corresponding  region  of  the  time-difference  map, 
-which  in  turn  excites  the  connected  region  of  the 
oculomotor  map,  which  evokes  the  necessary  cam¬ 
era  motion  to  foveate  the  stimuli. 

3.2  Developmental  Mechanisms 

Plasticity  can  be  introduced  into  the  simple  system 
above  in  two  ways:  1)  the  map  organization  could 
change  so  that  a  given  map  site  could  correspond 
to  different  locations  in  space.  2)  The  connections 
between  maps  could  change  so  that  a  given  site  could 
change  which  site(s)  it  connects  to  on  the  other  map. 

In  animals,  as  described  in  section  2.3,  the  orga¬ 
nization  of  the  maps  and  the  registration  between 
maps  is  tuned  during  the  critical  period  of  develop¬ 
ment.  Several  mechanisms  and  models  have  been 
proposed  to  account  for  this  organizational  process. 
The  mechanisms  we  use  for  map  organization  and 
alignment  on  Cog  are  inspired  by  similar  mecha¬ 
nisms  [6].  However,  different  combinations  of  mecha¬ 
nisms  are  used  depending  on  what  is  being  learned: 
i.e.  tuning  the  organization  within  a  map,  regis¬ 
tering  different  sensory  maps,  or  registering  sensory 
maps  and  motor  maps. 

A  variety  of  mechanisms  determine  how  map  connec¬ 
tions  are  established.  Guided  by  sensori-motor  ex¬ 
perience,  these  mechanisms  govern  how  connections 
are  modified  to  improve  behavioral  performance. 

•  Competition:  There  is  competition  between 
concurrently  active  sites  where  only  the  most 
active  site  is  modified  per  trial.  In  our  sys¬ 
tem,  the  most  active  is  currently  approximated 
as  the  centroid  of  activity  of  the  active  region. 
Furthermore,  each  site  of  a  given  map  can  only 
form  a  limited  number  of  connections  to  the 
other  map.  So,  candidate  map  sites  compete 
to  determine  those  that  can  connect  to  a  given 
site  on  the  other  map. 

•  Locality,  neighborhood  influences:  The  neigh¬ 
boring  sites  around  the  most  active  site  are  also 
updated  each  trial.  The  amount  a  neighbor¬ 
ing  site  is  adjusted  decays  with  distance  from 
the  maximally  active  site.  This  mechanism  pe¬ 
nalizes  long  connections  and  encourages  topo¬ 
graphic  organization.  The  size  of  the  neighbor¬ 
hood  can  vary  over  time.  Typically,  it  starts  off 


fairly  large  until  the  map  displays  some  rough 
topographic  organization,  then  it  decreases  as 
the  map  undergoes  fine  tuning  adjustments. 

•  Error  correction:  It  is  not  sufficient  that 
the  maps  are  topographically  organized  and 
aligned  -  they  must  be  organized  and  inter¬ 
faced  so  that  the  agent  performs  well  in  its  en¬ 
vironment.  For  tight  feedback  loop  sensorimo¬ 
tor  tcisks  (such  as  saccading  to  a  visual  stim¬ 
ulus),  an  appropriate  error  signal  is  very  im¬ 
portant  and  useful  for  tuning  the  behavior  of 
the  system.  Naturally,  the  error  signal  must  be 
a  good  measure  of  performance  and  obtainable 
at  a  fast  enough  rate  to  enable  on-line  learning. 
Connections  are  modified  to  reduce  the  discrep¬ 
ancy  between  current  performance  and  desired 
performance.  The  magnitude  of  the  correction 
is  proportional  to  the  size  of  the  error  on  that 
trial. 

•  Correlated  temporal  activity:  Hebbian  mecha¬ 
nisms  are  often  used  for  self-organizing  pro¬ 
cesses.  By  strengthening  connections  between 
simultaneously  active  sites,  they  are  useful  for 
relating  information  between  different  sensory 
maps. 

•  Learning  rate:  The  magnitude  of  the  adjust¬ 
ment  for  each  trial  is  also  proportional  to  the 
learning  rate.  The  learning  rate  can  vary  over 
time,  where  it  starts  of  relatively  large  for 
course  tuning,  and  then  decreases  for  finer  ad¬ 
justments. 

3.3  A  Simple  Behavior 

In  this  section  we  look  at  an  example  to  see  how 
these  mechanisms  are  applied  to  forming  and  orga¬ 
nizing  these  multiple  maps  to  perform  a  task.  A 
simple  orientation  task  is  the  ability  to  saccade  to 
noisy,  moving  stimuli  (clapping  hands,  shaking  a  rat¬ 
tle,  etc.).  We  say  that  a  good  saccade  centers  the 
stimulus  in  the  fovea  camera’s  field  of  view,  whether 
the  stimulus  is  seen  in  the  wide  field  of  view  or  the 
fovea  field  of  view.  We  assume  that  the  system  fa¬ 
vors  information  from  the  foveal  view  because  it  is 
of  higher  resolution  than  the  peripheral  view  and 
thereby  can  be  used  to  perform  a  more  accurate  sac¬ 
cade. 

Experience  dependent  plasticity  could  play  a  role  in 
several  ways.  It  could  be  used  to  guide  the  repre¬ 
sentational  organization  of  the  auditory  and  visual 
motion  maps,  guide  the  registration  between  the  au¬ 
ditory  and  visual  motion  maps,  or  guide  the  registra¬ 
tion  of  the  the  sensory  maps  to  the  oculomotor  map. 
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In  this  paper,  we  concentrate  on  the  registration  of 
sensory  maps  to  motor  maps. 

Each  mapping  process  can  he  viewed  as  learning  a 
multi-modal  map  that  registers  the  information  from 
two  other  modality  maps.  We  call  the  multi-modal 
map  the  registration  map,  and  the  other  maps  could 
be  either  sensory  maps,  motor  maps,  or  both.  One  of 
the  modality  maps  provides  the  rough  spatial  organi¬ 
zation  of  the  multi-modal  map.  We  call  this  map  the 
receptive  field  map.  Typically  it  has  a  topographic 
representation  of  space.  Often  a  retinotopic  map  is 
used,  for  instance.  The  second  modality  map,  which 
may  or  may  not  be  topographic,  is  registered  to  the 
first  map  through  the  multi-modal  map.  Note  that 
each  site  of  a  modality  map  connects  to  only  one 
site  on  the  registration  map,  but  the  same  site  on 
the  registration  map  could  connect  to  mulitple  sites 
on  the  modality  maps. 

3.3.1  Registration  of  Sensory-Motor  Maps 

An  example  of  aligning  sensor  and  motor  maps  is 
registering  the  oculomotor  map  with  the  visual  mo¬ 
tion  map.  In  this  case,  the  receptive  field  map  is 
the  visual  motion  map  (in  retinotopic  coordinates), 
the  registration  map  is  a  visuo-motor  map  (also  in 
retinotopic  coordinates),  and  the  third  map  is  the 
oculomotor  map  (in  eye  motor  coordinates).  Re¬ 
gions  in  the  motor  map  correspond  to  motor  move¬ 
ments  that  could  foveate  a  stimulus.  Initially  the 
visual  map  and  the  oculomotor  map  are  connected 
to  the  registration  map  with  broad,  overlapping  re¬ 
ceptive  fields.  When  the  motion  map  is  stimulated 
and  the  site  of  maximal  activity  is  determined  (typi¬ 
cally  the  centroid  of  the  stimulated  region),  the  cor¬ 
responding  region  of  the  oculomotor  map  is  stimu¬ 
lated.  The  site  of  maximal  response  of  the  motor 
map  is  taken  as  the  motor  command,  and  the  cor¬ 
responding  motor  movement  is  evoked.  This  move¬ 
ment  orients  the  eye  to  the  stimulus.  Once  oriented, 
the  motion  stimulus  stimulates  a  different  region  in 
the  visual  motion  map.  The  visual  error  is  com¬ 
puted  as  the  difference  from  centroid  of  motion  to 
the  center  of  the  field  of  view.  This  error  is  used 
to  update  the  connections  responsible  for  the  orien¬ 
tation  movement  to  reduce  the  error  in  the  future. 
Hence,  the  primary  developmental  mechanisms  are 
competition,  neighborhood  updates,  and  error  cor¬ 
rection. 

4  Architectural  Organization 

To  date,  the  sensory-motor  map  registration  task 
has  been  implemented  on  Cog’s  hardware  and  is 
shown  in  figure  1 .  The  diagram  shows  how  the  pro- 
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Figure  1:  This  diagram  shows  how  the  multi-modaJ  to¬ 
pographic  maps  are  arranged  on  Cog’s  computational 
hardware.  Currently  five  processors  are  used:  two  visual 
processors,  two  motor  control  processors,  and  one  pro¬ 
cessor  which  performs  the  developmental  mechanisms. 
See  text  for  further  explanation. 

cesses  are  arranged  on  Cog’s  MIMD  computer.  Cur¬ 
rently,  five  processing  nodes  are  used: 

•  Peripheral  motion  processor.  Contains  the  pe¬ 
ripheral  visual  motion  map.  It  computes  the 
difference  between  consecutive  left  peripheral 
camera  images  at  15  frames/s.  It  also  deter¬ 
mines  the  most  active  site  (the  centroid  of  mo¬ 
tion)  and  a  visual  error  signal. 

•  Fovea  motion  processor:  Contains  the  fovea  vi¬ 
sual  motion  map.  It  computes  the  difference 
between  consecutive  left  fovea  camera  images 
at  15  frames/s.  It  also  determines  the  most  ac¬ 
tive  site  (the  centroid  of  motion)  and  a  visual 
error  signal. 

•  Registration  processor:  Contains  the  visuo- 
motor  map  and  carries  out  the  developmental 
process.  It  receives  motion  information  from 
the  vision  processor  and  determines  which  mo¬ 
tion  information  to  use.  If  fovea  motion  is 
present,  it  ignores  the  information  from  the  pe¬ 
ripheral  camera.  It  also  translates  the  most 
active  site  on  the  visual  map  to  the  region  of 
activity  on  the  motor  map,  and  passes  this  in¬ 
formation  to  the  oculomotor  processor.  After 
the  motion  is  performed,  it  uses  the  error  sig¬ 
nal  from  the  vision  processors  to  update  the 
registration  map  connections  according  to  de¬ 
velopmental  mechanisms. 

•  Oculomotor  processor:  Contains  the  oculomo¬ 
tor  map.  Upon  receiving  the  site  of  activ- 


1370 


Cii  error  oj.rc:.': 


Saccade  Errorvs  Trals 


Figure  2:  Registration  data  for  aligning  the  visual  mo¬ 
tion  map  and  the  oculomotor  map.  The  data  is  derived 
from  the  registration  map,  converting  sites  in  visuotopic 
coordinates  to  activated  sites  in  the  motor  movement 
map  (pan  and  tilt)  required  to  foveate  the  stimulus.  Each 
trial,  the  neighborhood  size  is  set  equal  to  the  larger 
value  of  the  average  error  measures  (pan  or  tilt).  Ini¬ 
tially  the  maping  is  random,  with  a  neighborhood  size 
of  26  and  a  learning  rate  of  .25.  By  trial  =  100,  the 
average  error  is  reduced  to  (3°,  2°),  and  by  trial  =  400 
it  is  reduced  to  (1°,  1°).  After  training,  it  converges  to 
an  average  error  <  (1°,  1°). 

ity,  it  commands  the  motors  to  perform  the 
movement.  It  also  sends  an  “efferent  copy”  to 
the  registration  processor,  so  the  registration 
processor  can  ignore  visual  motion  information 
while  the  cameras  are  moving. 

•  Neckmotor  processor:  Contains  the  neckmotor 
map.  It  is  commanded  by  the  registration  pro¬ 
cessor  to  move  the  neck  around  so  the  motion 
stimulus  is  seen  from  many  different  places  in 
the  visual  field.  Currently,  the  neck  is  primar¬ 
ily  used  for  the  training  processes.  However, 
soon  it  will  incorporated  into  the  orientation 
behavior. 

5  Tests 

To  date  we  have  run  experiments  to  test  whether  the 
implementation  we  have  described  learns  the  regis¬ 
tration  between  the  retinotopic  visual  motion  map 
and  the  oculomotor  map.  A  sampling  of  our  results 
are  shown  in  figure  2. 

So  far,  experiments  have  been  performed  using  the 
left  eye  only.  The  system  will  be  extended  to  han¬ 
dle  both  eyes  when  stereopsis  and  vergence  capa¬ 
bilities  are  implemented.  Motion  information  from 
both  eyes  will  be  fused  and  used  to  excite  the  visuo¬ 
topic  motion  map.  Conflicts  between  the  eyes  will 
be  resolved  during  this  fusion  stage.  The  simplest 
approach  would  be  to  resolve  conflicts  via  a  domi¬ 
nant  eye  mechanism.  Another  method  could  involve 


exciting  the  visuotopic  map  with  the  stonger  of  the 
two  excitations  coming  from  each  eye.  These  two 
methods  along  with  other  possibilities  need  to  be  ex¬ 
plored.  Most  likely,  a  combination  of  methods  will 
be  implemented. 

To  learn  the  registration  between  the  peripheral  mo¬ 
tion  map  with  the  oculomotor  map,  we  trained  Cog 
over  a  number  of  trials  while  it  looked  at  a  contin¬ 
uously  moving  stimulus.  At  the  beginning  of  each 
trial,  the  robot  changes  its  posture  (centers  its  eyes 
and  moves  its  neck  to  a  random  location).  This 
places  the  motion  stimulus  in  a  different  location 
in  its  visual  field.  Currently  Cog  explores  the  center 
20°  X  20°  of  the  peripheral  visual  field,  which  corre¬ 
sponds  to  a  20  X  20  region  of  the  registration  map. 
The  robot  uses  the  visual  information  to  stimulate 
the  oculomotor  map  and  perform  the  saccade.  The 
visual  error  is  then  acquired,  and  the  registration 
between  the  maps  is  updated  according  to  the  rule: 

Am{x,y)  =  p  X  e{x,y)  X  N{x,y)  (1) 

where: 

•  m(x,  y)  is  the  value  of  site  {x,  y)  of  registration 
map  m.  Recall  that  this  value  represents  the 
connection  from  the  visual  motion  map  site  to 
the  corresponding  oculomotor  map  site.  The 
learning  process  involves  updating  these  inter¬ 
map  connections. 

•  (x*  ,y*)  is  the  site  of  maximal  activity  of  the 
motion  map.  For  this  application,  it  corre¬ 
sponds  to  the  site  of  maximal  activity  of  the 
registration  map  as  well. 

•  p  is  the  learning  rate. 

•  t{x,y)  =  target(x,y)  —  m{x,y).  It  is  an  error 
distance  measure  between  the  motion  map  site 
m{x,y)  and  the  target  site.  This  measurement 
is  made  after  the  saccade  motion  finishes.  Note 
that  target{x,  y)  is  the  center  of  the  field  of  view 
for  the  saccade  learning  task. 

•  r  is  the  neighborhood  radius. 

•  N{x,  y)  =  /(I  —  )~(^iy)l  y  It  Is  the  neigh¬ 

borhood  update  function  that  decays  linearly 
with  distance  from  the  site  of  maximal  activity 
{x*,y*).  Threshold  function,  /,  sets  the  result 
equal  to  zero  if  its  argument  is  negative.  So, 
for  site  locations  outside  radius  r  of  (x*,y*)  , 
N{x*,y*)  =  0. 
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6  Continued  Work 

Currently,  we  are  extending  these  tests  to  include  the 
full  visual  field,  and  continuing  our  experiments  with 
dynamically  varying  neighborhood  size  and  learning 
rate  parameters.  Soon  we  will  explore  the  self  orga¬ 
nizing  properties  of  the  representation  of  visual  in¬ 
formation  by  implementing  a  SOFM  for  visual  mo¬ 
tion.  We  will  also  begin  efforts  to  register  the  vi¬ 
sual  motion  with  an  auditory  ITD  map,  as  well  as 
investigate  the  dynamics  of  development  when  self¬ 
organization  and  registration  mechanisms  are  run  si¬ 
multaneously.  We  expect  to  see  evidence  for  differ¬ 
ent  developmental  time  scales  as  the  robot  learns  the 
orientation  task. 

With  the  above  work  in  place,  we  will  extend  the 
system  to  include  the  neck  and  body  degrees  of  free¬ 
dom  so  that  the  robot  can  perform  full  body  orienta¬ 
tion  behavior.  This  will  complicate  the  current  task 
by  adding  additional  degrees  of  freedom  that  must 
complement  each  other.  We  will  continue  to  inves¬ 
tigate  the  issue  of  developmental  time  scales  since 
more  complicated  behaviors  will  have  to  develop  in¬ 
crementally  and  bootstrap  off  of  existing  behaviors. 
We  would  like  to  integrate  the  full  orientation  behav¬ 
ior  with  reaching  and  manipulation  tasks  currently 
under  parallel  development  by  other  members  of  the 
group  [9]. 

7  Conclusions 

This  paper  describes  an  implementation  of  orienta¬ 
tion  behavior  on  Cog  using  registered  topographic 
maps.  We  have  presented  biological  evidence  that 
this  is  an  effective  method  for  orienting  to  multi¬ 
modal  stimuli  in  animals.  We  have  also  presented 
a  series  of  mechanisms  and  methods  for  developing 
this  behavior  on  Cog  over  time.  This  biologically  in¬ 
spired  framework  gives  us  the  opportunity  to  explore 
several  interesting  issues.  It  allows  us  to  investi¬ 
gate  using  dynamic  spatio-temporal  representations 
of  sensory-motor  space  to  integrate  multi-modal  in¬ 
formation  and  produce  a  unified  behavior.  It  also  al¬ 
lows  us  to  investigate  the  dynamics  of  development 
using  mechanisms  of  experience  dependent  plastic¬ 
ity.  Ongoing  work  is  promising. 

Acknowledgements 

A  number  of  people  have  made  this  work  and  the 
Cog  Project  possible.  Among  those  who  have  con¬ 
tributed  to  the  ongoing  development  of  Cog  are  (in 
alphabetical  order):  Mike  Binnard,  Rod  Brooks, 
Robert  Irie,  Eleni  Kapogannis,  Matt  Marjanovic, 


Yoky  Matsuoka,  Brian  Scasselatti,  Nick  Shectman, 

Rene  Schaad,  and  Matt  Williamson. 

References 

[1]  H.  Bauer  and  K.  Pawelzik.  Quantifying  the 
neighborhood  preservation  of  self-organizing 
feature  maps.  IEEE  Transactions  on  Neural 
Networks,  3(4):570-579,  1992. 

[2]  M.  Brainard  and  E.  Knudsen.  Experience- 
dependent  plasticity  in  the  inferior  collicu¬ 
lus:  A  site  for  visual  calibration  of  the  neu¬ 
ral  representation  of  auditory  space  in  the  barn 
owl.  The  Journal  of  Neuroscience,  13(11):4589- 
4608,  1993. 

[3]  M.  Brainard  and  E.  Knudsen.  Dynamics  of  vi¬ 
sually  guided  auditory  plasticity  in  the  optic 
tectum  of  the  barn  owl.  Journal  of  Neurophys¬ 
iology,  73(2):595-614,  1995. 

[4]  R.  Brooks  and  L.  A.  Stein.  Building  brains  for 
bodies.  Autonomous  Robots,  1:1:7-25,  1994. 

[5]  R.  Durbin  and  G.  Mitchison.  A  dimension  re¬ 
duction  framework  for  understanding  cortical 
maps.  Nature,  343(6259):644-647,  1990. 

[6]  T.  Kohonen.  Self-organized  formation  of  topo¬ 
logically  correct  feature  maps.  Biological  Cy¬ 
bernetics,  43:59-69,  1982. 

[7]  K.  Obermayer,  H.  Ritter,  ,  and  K.  Schulten.  A 
principle  for  the  formation  of  the  spatial  struc¬ 
ture  of  cortical  feature  maps.  Proceedings  of  the 
National  Academy  of  Science  USA,  87:8345- 
8349,  1990. 

[8]  H.  Ritter  and  K.  Schulten.  Kohonen’s  self¬ 
organizing  maps:  Exploring  their  computa¬ 
tional  capabilites.  In  Proceedings  of  the  IEEE 
International  Conference  on  Neural  Networks, 
volume  1,  pages  109-116,  San  Diego,  CA,  1988. 

[9]  B.  Scassellati,  M.  Williamson,  and  M.  Mar¬ 
janovic.  Self-taught  visually-guided  pointing  for 
a  humanoid  robot.  In  Proceedings  of  the  4lh 
Inti.  Conference  on  Simulation  of  Adatpive  Be¬ 
havior,  Cape  Cod,  MA,  1996. 

[10]  B.  Stein  and  M.  Meredith.  The  Merging  of  the 
Senses.  A  Bradford  Book,  Cambridge,  MA, 
1993. 


1372 


A  Bootstrapping  Algorithm  for  Learning  Linear  Models  of 

Object  Classes 


Thomas  Vetter 
Max-Plank-Institut 
fur  biologische  Kybernetik 
72076  Tubingen,  Germany 
email;  vetter@mpik-tueb.mpg.de 


Michael  J.  Jones 
Center  for  Biological  and 
Computational  Learning 
MIT,  Cambridge,  MA  02139 
email:  mjones@ai.mit.edu 


Tomaso  Poggio 
Center  for  Biological  and 
Computational  Learning 
MIT,  Cambridge,  MA  02139 
email:  tp@ai.mit.edu 


Abstract 

Flexible  models  of  object  classes,  based  on  lin¬ 
ear  combinations  of  prototypical  images,  are  ca¬ 
pable  of  matching  novel  images  of  the  same  class 
and  have  been  shown  to  be  a  powerfiil  tool  to 
solve  several  fundamental  vision  tasks  such  as 
recognition,  synthesis  and  correspondence.  The 
key  problem  in  creating  a  specific  fiexible  model 
is  the  computation  of  pixelwise  correspondence 
between  the  prototypes,  a  task  done  until  now  in 
a  semiautomatic  way.  In  this  paper  we  describe 
ein  algorithm  that  automatically  bootstraps  the 
correspondence  between  the  prototypes.  The 
algorithm  -  which  can  be  used  for  2D  images  as 
well  as  for  3D  models  -  is  shown  to  synthesize 
successfully  a  fiexible  model  of  frontal  face  im¬ 
ages  and  a  flexible  model  of  handwritten  digits. 

1  Introduction 

In  recent  papers  we  have  introduced  a  new  type 
of  flexible  model  for  images  of  objects  of  a  cer¬ 
tain  class.  The  idea  is  to  represent  images  of 
a  certain  type  -  for  instance  images  of  frontal 
faces  -  as  the  linear  combination  of  prototype 

“This  research  is  sponsored  by  grants  from  ARPA- 
ONR  under  contract  N00014-92-J-1879  and  from  ONR 
under  contract  N00014-93-1-0385  and  by  a  grant  from 
the  National  Science  Foundation  under  contract  ASC- 
9217041  (this  award  includes  funds  from  ARPA  pro¬ 
vided  under  the  HPCC  program)  and  by  a  MURI  grant 
N00014-95-1-0600.  Additional  support  is  provided  by 
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images  and  their  affine  deformations.  This  flex¬ 
ible  model  can  be  used  as  a  generative  model 
to  synthesize  novel  images  of  the  same  class.  It 
can  also  be  used  to  analyze  novel  images  by  es¬ 
timating  the  model  parameters  via  an  optimiza¬ 
tion  procedure.  Once  estimated  the  model  can 
be  used  for  indexing,  for  recognition,  for  image 
compression  and  for  image  correspondence. 

At  the  very  heart  of  our  flexible  models  is  an 
image  representation  in  terms  of  which  a  lin¬ 
ear  combination  of  images  makes  sense.  For 
a  set  of  images  to  behave  as  vectors,  they 
must  be  in  pixelwise  correspondence  (see  [Pog¬ 
gio  and  Beymer,  1996]).  Our  model  uses  pix¬ 
elwise  correspondence  between  example  images 
and  should  not  be  confused  with  techniques 
which  use  linear  combinations  of  images  such  as 
the  so-ceiUed  eigenfaces  technique  ([Kirby  and 
Sirovich,  1990];  [Turk  and  Pentland,  1991]).  In 
om  approach,  the  correspondences  between  a 
reference  image  and  the  other  example  images 
are  obtained  in  a  preprocessing  phase.  Once  the 
correspondences  axe  computed,  an  image  is  rep¬ 
resented  as  a  shape  vector  and  a  texture  vector. 
The  shape  vector  specifies  how  the  2D  shape  of 
the  example  differs  from  a  reference  image  and 
corresponds  to  the  flow  field  between  the  two 
images.  Analogously,  the  texture  vector  spec¬ 
ifies  how  the  texture  differs  from  the  reference 
texture.  Here  we  axe  using  the  term  “texture” 
to  mean  simply  the  pixel  intensities  (grey  level 
or  color  values)  of  the  image.  Our  flexible  model 
for  an  object  class  is  then  a  linear  combination 
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of  the  example  shape  and  texture  vectors. 

1.1  A  key  problem:  creating  the 
model  from  prototypes 

The  distinguishing  aspect  of  our  flexible  models 
is  that  they  are  linear  combinations  of  proto¬ 
type  shape  and  texture  vectors  and  not  of  im¬ 
ages  ([Beymer  and  Poggio,  1996]).  The  proto¬ 
typical  images  must  be  vectorized  first,  that  is 
correspondence  must  be  computed  among  them. 

This  is  a  key  step  and  in  general  a  difficult  one. 
It  needs  to  be  done  only  once  at  the  stage  of  de¬ 
veloping  the  model.  At  run-time  no  further  cor¬ 
respondence  is  needed  —  and  in  fact  the  model 
can  be  used  to  compute  correspondence  if  nec¬ 
essary.  In  our  past  papers  we  computed  cor¬ 
respondence  between  the  prototypes  with  au¬ 
tomatic  techniques  such  as  optical  flow.  Some¬ 
times,  however,  we  were  forced  to  use  interactive 
techniques  requiring  the  user  to  specify  at  least 
some  of  the  correspondences  (see  for  instance 
[Lines,  1996]).  An  automatic  technique  that 
could  set  prototypes  in  correspondence  woiild 
be  therefore  desirable  even  if  very  slow.  In  ad¬ 
dition,  any  claim  of  biological  plausibility  would 
require  the  demonstration  of  such  a  technique. 

In  this  paper  we  describe  a  bootstrapping  tech¬ 
nique  that  seems  capable  of  computing  corre¬ 
spondence  between  prototypical  images  in  cases 
in  which  standard  optical  flow  algorithms  fail. 

2  Linear  models 

For  the  formal  specification  of  the  model,  we  re¬ 
fer  the  reader  to  [Jones  and  Poggio,  1996].  The 
relevant  notation  used  in  that  paper  and  ref- 
ered  to  later  in  this  one  is  as  follows.  Images 
Iq  through  In  are  the  prototype  images.  The 
flow  field  Sj  :  maps  the  points  of 

Jo  onto  Ij.  Sj  is  also  called  the  shape  vector. 
Tj  =  Ij  o  Sj  is  the  texture  vector  for  image  Ij. 

2.1  Matching  the  model 

The  matching  algorithm  is  also  described  in  de¬ 
tail  in  [Jones  and  Poggio,  1996].  In  brief,  the 
matching  algorithm  attempts  to  minimize  the 
error  between  the  novel  input  image  and  the 
current  guess  for  the  best  fitting  model  image. 
This  is  done  by  a  gradient  descent  algorithm 


which  adjusts  the  parameters  of  the  model  to 
minimize  the  L2  error  between  the  novel  and 
model  images. 

2.2  Optical  Flow 

For  some  prototypes,  the  pixelwise  correspon¬ 
dences  from  the  reference  image  to  the  proto¬ 
type  can  be  found  accurately  by  an  optical  flow 
algorithm.  We  have  mostly  used  the  multires¬ 
olution,  laplacian-based,  optical  flow  algorithm 
described  in  [Bergen  and  Hingorani,  1990]. 
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Figure  1:  Given  the  flexible  model  provided  by 
the  combination  of  image  1  and  image  2  (in  cor¬ 
respondence) ,  the  goal  is  to  find  the  correspon¬ 
dence  between  image  1  (or  image  2)  and  the 
novel  image  3.  Our  solution  is  to  first  find  the 
linear  combination  of  image  1  and  image  2  that 
is  closest  to  image  3  (this  is  image  1’)  and  then 
find  the  correspondences  from  image  1’  to  im¬ 
age  3  using  optical  flow.  The  two  flow  fields  can 
then  be  composed  to  yield  the  desired  flow  from 
image  1  to  image  3. 

3  Bootstrapping  the  synthesis  of  a 
flexible  model 

Suppose  that  we  have  a  flexible  model  consisting 
of  N  protoypes  in  correspondence.  It  is  tempt¬ 
ing  to  try  to  use  it  to  compute  the  correspon¬ 
dence  to  a  novel  image  of  an  object  of  the  same 
class  so  that  it  can  be  added  to  the  set  of  proto¬ 
types.  The  obvious  flaw  in  this  strategy  is  that 
if  the  flexible  model  can  compute  good  corre¬ 
spondence  to  the  new  image  then  there  is  no 
need  to  add  it  to  the  flexible  model  since  it  will 
not  increase  its  expressive  power.  If  it  can’t, 
then  the  new  prototype  cannot  be  incorporated 
as  such.  A  possible  way  out  of  this  conundrum 
is  to  bootstrap  the  flexible  model  by  using  it 
together  with  an  optical  flow  algorithm. 
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Figure  2:  This  figure  shows  the  basic  idea  behind  bootstrapping.  Image  (a)  is  the  reference  face. 
Image  (b)  is  a  prototype.  Image  (c)  is  the  image  resulting  from  backward  warping  the  prototype 
onto  the  reference  face  using  the  correspondences  found  by  an  optical  flow  algorithm.  Image  (d) 
is  the  model  image  which  best  matches  the  prototype  using  a  model  consisting  of  20  prototypical 
faces  (which  did  not  include  image  (b)).  Image  (e)  is  the  image  resulting  from  backward  warping 
the  prototype  onto  the  reference  face  using  the  flow  field  which  was  composed  from  matching  the 
face  model  and  then  running  an  optical  flow  algorithm  between  image  (d)  and  image  (b)  to  further 
improve  the  correspondences.  This  is  the  basic  step  of  the  bootstrapping  algorithm. 


3.1  The  basic  recursive  step:  improv¬ 
ing  the  match  with  optical  flow 

Suppose  that  an  existing  flexible  model  is  not 
powerful  enough  to  match  a  new  image  and 
thereby  And  correspondence  with  it.  The  idea  is 
first  to  And  rough  correspondences  to  the  novel 
image  using  the  (inadequate)  flexible  model  and 
then  to  improve  these  correspondences  by  using 
an  optical  flow  algorithm.  This  idea  is  illus¬ 
trated  in  figure  1.  In  the  figure,  a  model  consist¬ 
ing  of  image  1  and  image  2  (and  the  correspon¬ 
dences  between  them)  is  first  fit  to  image  3.  Call 
the  best  fitting  linear  combination  of  images  1 
and  2  image  1’.  The  correspondences  are  then 
improved  by  running  an  optical  flow  algorithm 
between  the  intermediate  image  1’  and  image  3. 
Notice  that  this  technique  can  be  regarded  as  a 
class  specific  regularization  of  optical  flow. 

3.1.1  Example 

An  example  of  our  basic  step  is  shown  in  fig¬ 
ure  2.  In  this  figure,  an  opticzil  flow  algorithm 
is  used  to  find  the  correspondences  from  image 
(a)  to  image  (b).  The  resulting  correspondences 


are  not  very  good  as  shown  by  image  (c)  which 
is  the  backward  warp  of  image  (b)  according  to 
the  correspondences  found  by  optical  flow.  Im¬ 
age  (c)  should  have  the  texture  of  image  (b) 
and  the  shape  of  image  (a).  A  better  way  to 
find  the  correspondences  is  to  first  fit  a  model 
of  faces  to  image  (b),  by  using  as  a  model  20 
prototype  face  images  (with  known  correspon¬ 
dences).  The  model  is  matched  to  image  (b) 
as  described  in  section  2.1.  The  resulting  best 
match  is  shown  as  image  (d).  Next,  optical  flow 
was  run  between  image  (d)  and  image  (b)  to  fur¬ 
ther  improve  the  correspondences  found  by  the 
matching  algorithm.  The  two  correspondence 
fields  are  combined  to  get  the  correspondences 
from  image  (a)  to  image  (b).  Image  (e)  is  the 
backward  warp  of  image  (b)  according  to  the  fi¬ 
nal  correspondence.  Comparing  image  (c)  with 
(e)  shows  better  correspondences  are  found  with 
bootstrapping. 

3.2  The  bootstrapping  algorithm 

The  idea  of  bootstrapping  is  to  start  from  a 
small  flexible  model  consisting  of  just  two  pro- 
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totypical  images  and  to  increase  its  size  (and 
representation  power)  by  iterating  the  recursive 
step  described  above,  progressively  adding  new 
images  by  setting  them  in  correspondence  with 
the  model. 

There  are  two  main  problems  with  building  a 
linear  flexible  model.  The  first  one  is  to  choose 
the  reference  image,  relative  to  which  shape  and 
texture  vectors  are  represented.  The  second  is 
to  automatically  compute  the  correspondences 
even  in  cases  in  which  optical  flow  fails. 

In  principle,  any  example  image  coTild  be  used 
as  the  reference  image.  However,  small  pecu¬ 
liarities  in  an  image  can  influence  the  matching 
process.  Thus,  an  image  which  is  close  to  all  im¬ 
ages  is  more  reliable,  since  the  computation  of 
the  correspondence  is  more  stable  for  small  dis¬ 
tortions  them  for  bigger  ones.  The  average  im¬ 
age  of  the  whole  data  set,  for  which  the  average 
distance  to  the  whole  data  set  is  by  deflmtion  at 
TniniTnum,  is  the  optimal  reference  image.  Since 
the  correspondences  between  the  images  can¬ 
not  be  computed  correctly  in  one  step,  the  av¬ 
erage  has  to  be  computed  in  an  iterative  proce¬ 
dure.  Starting  from  an  arbitrary  image  as  the 
preliminary  reference,  a  (noisy)  correspondence 
between  all  other  images  and  this  reference  is 
first  computed  using  an  optical  flow  algorithm. 
On  the  basis  of  these  correspondences  an  aver¬ 
age  image  cem  be  computed,  which  now  serves 
as  a  new  reference  image.  This  procedure  of 
computing  the  correspondences  and  calculating 
a  new  average  image  is  repeated  until  a  stable 
average  (vectorized)  image  is  obtained. 

The  correspondence  fields  obtained  through  the 
optical  flow  algorithm  from  this  final  average 
image  to  aU  the  excimples  are  usually  fax  from 
perfect.  The  bootstrapping  idea  is  to  improve 
the  correspondences  by  applying  iteratively  the 
basic  step  described  above  while  also  increas¬ 
ing  the  expressive  power  of  the  flexible  model. 
We  could  incorporate  into  the  flexible  model  one 
new  image  at  each  timestep.  Instead,  we  have 
implemented  an  equivalent  algorithm  in  which 
the  first  step  is  to  form  a  linear  object  model 
from  the  correspondences  obtained  from  all  im¬ 
ages  with  optical  flow.  Since  some  of  these  cor¬ 
respondence  fields  are  not  correct  and  all  are 


noisy,  this  algorithm  uses  only  the  most  signif¬ 
icant  fields  as  provided  by  a  standard  PC  A  de¬ 
composition  of  the  shape  and  the  texture  vec¬ 
tors.  Instead  of  adding  new  images,  the  algo¬ 
rithm  increases  with  successive  iterations  the 
number  of  principal  components,  ordered  ac¬ 
cording  to  the  associated  eigenvedues  (the  al¬ 
lowed  range  of  parameters  of  the  selected  prin- 
ciped  components  can  also  be  increased  with 
a  similar  effect).  At  each  iteration  a  flexible 
model  is  selected  and  used  to  match  each  image. 
The  optical  flow  algorithm  estimates  correspon¬ 
dence  between  the  image  and  the  approximation 
provided  by  the  flexible  model.  This  field  is 
then  added  to  the  correspondence  field  implied 
by  the  matched  model,  giving  a  new  correspon¬ 
dence  field  between  the  reference  image  and  the 
example.  The  correspondence  fields,  obtained 
by  this  procedure,  will  finally  lead  to  a  new 
average  image  and  also  to  new  principal  com¬ 
ponents  which  can  be  incorporated  in  an  im¬ 
proved  flexible  model.  Iterating  this  procedure 
with  increasing  expressive  power  of  the  model 
(by  increasing  the  number  of  principal  compo¬ 
nents)  leads  to  stable  correspondence  fields  be¬ 
tween  the  reference  image  and  the  examples. 
The  number  of  iterations  as  weU  as  the  increas¬ 
ing  complexity  of  the  model  can  be  regarded  as 
regularization  parameters  of  this  bootstrapping 
process. 

3.2.1  Pseudo  code  of  an  efficient  algo¬ 
rithm 

lA:  Selecting  a  reference  image. 

Select  some  image  li  as  reference  image  Iref- 
Until  convergence  do  { 

For  all  Jf  { 

Compute  correspondence  field  Si 
between  Iref  and  li  using  optical  flow. 
Backwards  warp  Jj  onto  Iref  using  Si 
to  get  the  texture  map  Tj. 
end  For} 

Compute  average  over  all  Si  md  Tj 
Forward  warp  Tavg  using  Savg  to 
create  lavcTugc 

Convergence  test;  is  lavg  -  Iref  <  limit  1 
Copy  lavg  to  Iref  i 
end  Until  } 
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IB:  Computing  the  correspondence. 

Until  number  n  of  principal  components  used 
in  the  hnear  model  is  majcimal  { 

Perform  a  principal  component  analysis  on 
{5i}  and  separately  on  {Ti}. 

Select  the  first  n  principal  components  for 
the  linear  model. 

Approximate  each  li  by  the  hnear  model 
with 

Compute  correspondence  field  S\  between 
jmodel  using  Optical  flow 

Combine  S'i  and  to 

Backwards  warp  li  onto  I^ef  using 
to  get  the  textme  map  Tj. 

Copy  aU  to  Si. 

Increase  number  n  of  principal  components 
used  in  the  hnear  model, 
end  Until  } 

4  Results 

The  method  described  in  the  previous  sections 
was  tested  on  two  different  classes  of  images. 
One  class  was  frontal  views  of  htunan  faces  and 
the  second  was  handwritten  digits.  We  describe 
here  only  the  first  class  (for  the  second  see  [Vet¬ 
ter  et  al.,  1997]). 

4.1  Face  images 

4.1.1  Data  set 

Frontal  images  of  130  Caucasian  faces  were  used 
in  our  experiments.  The  images  were  origi- 
naUy  rendered  for  psychophysical  experiments 
[Troje  and  Biilthoff,  1995]  imder  ambient  iUu- 
mination  conditions  from  a  data  base  of  three- 
dimensional  human  head  models  recorded  with 
a  laser  scanner  [Cyberware^^).  AU  faces  were 
without  malceup,  accessories,  cind  facial  hair. 
AdditionaUy,  the  head  haiir  was  removed  digi- 
taUy  (but  with  manual  editing),  via  a  vertical 
cut  behind  the  ears.  The  resolution  of  the  grey- 
level  images  was  256-by-256  pixels  eind  8  bit. 

Preprocessing:  First  the  faces  were  segmented 
from  the  backgroimd  and  aligned  roughly  by 
automaticaUy  adjusting  them  to  their  two- 
dimensional  centroid.  The  centroid  was  com¬ 
puted  by  evaluating  separately  the  average  of 
aU  coordinates  of  the  image  pixels  related 
to  the  face  independent  of  their  intensity  value. 


4.1.2  Evaluation 

The  method  described  in  the  previous  sections 
was  successfuUy  appUed  to  all  face  images  avail¬ 
able. 

The  step  involving  synthesis  of  the  reference 
(average)  image  was  tested  for  each  image  as 
a  starting  image  in  the  algorithm.  As  a  conver¬ 
gence  criteria  we  used  a  theshold  on  the  min¬ 
imum  average  change  of  the  pixel  gray  value 
(0.3,  whereas  the  range  was  256).  The  thresh¬ 
old  was  reached  in  every  case  within  5  iterations 
and  mostly  after  3.  The  final  reference  images 
could  not  be  distinguished  under  visual  inspec¬ 
tion.  One  of  these  reference  images  is  shown 
in  the  second  column  of  figure  3;  the  same  ref¬ 
erence  image  was  used  for  the  final  correspon¬ 
dence  finding  procedure. 

Optical  flow  yields  the  correct  correspondence 
between  the  reference  image  and  each  example 
image  only  in  80%  of  aU  cases.  In  the  remain¬ 
ing  cases  the  correspondence  is  partly  incorrect, 
as  shown  in  figure  3.  The  center  column  shows 
the  images  which  result  from  backward  warp¬ 
ing  the  face  images  (left  column)  onto  the  refer¬ 
ence  image  using  the  correspondence  fields  ob- 
t^lined  through  the  optical  flow  algorithm.  In 
the  first  iteration  of  the  correspondence  finding 
procedure  the  first  2  principal  components  of 
the  shape  vectors  (that  is  of  the  correspondence 
fields)  cind  of  the  textures  vectors  are  used  in  the 
flexible  model.  Then  the  correspondence  field 
provided  by  matching  with  the  flexible  model 
is  combined  with  the  correspondence  field  ob¬ 
tained  by  the  optical  flow  algorithm  between 
the  face  image  and  its  flexible  model  approxi¬ 
mation.  The  backward  warps  using  these  cor¬ 
respondence  fields  are  shown  in  the  fourth  col¬ 
umn.  The  correspondence  fields  were  iterated 
by  slowly  increasing  the  number  of  principal 
components  used  in  the  flexible  model.  After 
four  iterations  with  2,  10,  30  and  80  principal 
components,  the  correspondence  fields  between 
the  reference  face  and  all  example  images  did 
not  reveal  any  obvious  errors  (right  column). 

5  Conclusions 

The  bootstrapping  algorithm  we  described  is 
not  a  full  Einswer  to  the  problem  of  computing 
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Figure  3:  Two  of  the  most  difficult  faces  in  our  data  set.  The  correspondence  between  face  images 
(left  column)  and  a  reference  face  can  be  visualized  by  backward  warping  of  the  face  images  onto 
the  reference  image  (three  columns  on  the  right).  The  correspondence  obtained  through  the  optical 
flow  algorithm  does  not  allow  a  correct  mapping  (center  column).  The  first  iteration  with  a  linear 
flexible  model  consisting  of  two  principal  components  already  yields  a  significant  improvement  (top 
row).  After  four  iterations  with  10,  30  and  80  components,  respectively,  all  correspondences  were 
correct  (right  column) 


correspondence  between  prototypes.  It  provides 
however  an  initial  and  promising  solution  to  the 
very  difficult  problem  of  automatic  synthesis  of 
the  flexible  models  from  a  set  of  prototypical 
examples.  Notice  that  we  have  used  rnffitires- 
olution  optical  flow  as  one  part  of  our  boot¬ 
strapping  Eilgorithm.  In  principle  other  match¬ 
ing  techniques  could  be  used  within  our  boot¬ 
strapping  scheme. 
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Abstract 

MIT’s  Modular  Vision  System  architecture  con¬ 
sists  of  three  parts:  Pentium  processor,  software 
modules,  and  hardware  modules.  The  hardware 
modules  are  further  classified  into  application- 
specific  and  programmable  accelerators.  This 
paper  describes  a  programmable  pixel-parallel 
accelerator.  Features  include  its  compact  phys¬ 
ical  size  (desktop),  efficient  control  path  archi¬ 
tecture,  convenient  programming  environment 
with  the  C-l-4-  programming  language,  and  real¬ 
time  capability  (e.g.,  9.9  msec  for  optical  flow). 
Examples  of  experimented  applications  include 
template  matching,  optical  flow,  and  stereo 
matching. 

1  Introduction 

Typical  low-level  image  processing  tasks,  in¬ 
cluding  optical  flow,  template  matching,  stereo 
matching,  and  smooth  and  segmentation  com¬ 
putations,  require  thousands  of  operations  per 
pixel  for  each  input  image.  Traditional  general- 
purpose  computers,  such  as  PCs,  are  not  capa¬ 
ble  of  performing  such  tasks  in  real  time. 

The  mismatch  between  the  demands  of  low-level 
image  processing  tasks  and  the  characteristics  of 
conventional  computers  motivates  investigation 
of  alternative  architectures  for  a  programmable 
hardware  module  as  part  of  MIT’s  Modular  Vi¬ 
sion  System  shown  in  Figure  1.  MIT’s  Mod¬ 
ular  Vision  System  is  being  developed  as  a 

‘This  work  was  partly  supported  by  contract  N00014- 
95-1-0600  from  the  Office  of  Naval  Research 


common  computing  platform  with  high  perfor¬ 
mance,  low  cost,  modular  components.  It  con¬ 
sists  of  commercial  microprocessors  with  com¬ 
plementary  hardware  modules.  The  structure 
of  image  processing  tasks  suggests  employing  a 
single  instruction  stream,  multiple  data  stream 
(SIMD)  design:  an  array  of  processing  elements, 
one  per  pixel,  sharing  instructions  issued  by  a 
single  controller. 


Figure  1:  MIT’s  Modular  Vision  System  Ar¬ 
chitecture 

Massively  parallel  SIMD  supercomputers,  pro¬ 
viding  large  processing  element  arrays,  are  able 
to  perform  image  processing  tasks  in  real  time. 
But  their  million-dollar  price  tags  preclude 
widespread  use.  These  systems,  intended  for 
a  broader  range  of  applications,  offer  capabil¬ 
ities  far  beyond  those  necessary  for  processor- 
per-pixel  image  processing.  For  example,  the 
Connection  Machine  Models  CM-1  and  CM- 
2  [Tucker  and  Robertson,  1988]  provide  over 
four  thousand  bits  of  memory  per  processing  el¬ 
ement.  Sophisticated  processing  element  com¬ 
munication  networks  employed  in  the  CM-1  and 


CM-2  account  for  a  large  portion  of  the  cost  of 
the  machines,  but  for  most  low-level  image  pro¬ 
cessing  tasks,  the  networks  offer  little  advantage 
over  much  simpler  designs. 
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Figure  2:  Image  processing  system  using  fully 
integrated  processing  elements. 

We  present  a  system  design  capable  of  sup¬ 
porting  real-time  image  processing  in  a  desk¬ 
top  computer  environment.  Figure  2  illus¬ 
trates  principal  components  of  the  system  de¬ 
sign.  Modern  VLSI  technology  permits  the  eco¬ 
nomical  implementation  of  large  processor-per- 
pixel  arrays.  A  two-dimensional  network  con¬ 
nects  the  processing  elements.  As  shown  in  the 
figure,  control  and  data  paths  are  distinct.  The 
processing  element  array  receives  instructions 
from  the  controller,  which  is  managed  by  the 
host  computer.  With  the  image  data  path,  data 
format  is  converted  for  mapping  image  data 
over  the  array  processor  and  then  re-converted 
after  the  processing. 

2  System  Architecture 

System  architecture  is  defined  significantly  by 
two  factors:  the  control  scheme  of  array- 
processor  chips  with  the  host  computer  and  the 
architecture  of  the  array-processor  chips.  For 
the  control  scheme,  one  might  consider  using 
software  executed  on  a  host  computer  to  gen¬ 
erate  low-level  instructions  in  order  to  mini¬ 
mize  system  cost.  In  this  approach,  low-level 
instructions  are  transferred  over  the  host’s  back¬ 


plane  bus.  Unfortunately,  there  are  two  serious 
problems  with  direct  sequential  control.  First, 
the  host  computer  may  not  be  able  to  compute 
instructions  rapidly  enough  to  keep  the  array 
busy.  Second,  the  rate  at  which  the  host  can 
deliver  instructions  to  the  array  is  limited  by 
bus  transaction  delays. 

To  reduce  the  bus  bandwidth  and  host  compu¬ 
tation  required  to  sustain  array  activity,  modern 
SIMD  systems  typically  employ  a  multi-level 
control  hierarchy.  In  a  conventional  two-level 
design,  a  microcontroller  is  placed  between  the 
host  and  the  processing  element  array.  The  mi¬ 
crocontroller  executes  microcode,  interpreting 
macroinstructions  issued  by  a  host  and  trans¬ 
ferred  over  the  host’s  backplane  bus.  The  mi¬ 
crocontroller  produces  the  instructions  executed 
by  the  array.  Typical  macroinstructions  pro¬ 
duce  many  array  instructions.  Thus,  the  de¬ 
mands  on  the  host  computer  and  bus  are  re¬ 
duced.  The  Connection  Machine  [Hillis,  1985] 
and  the  Massively  Parallel  Processor  [Batcher, 
1985]  employ  two-level  designs.  The  associa¬ 
tive  string  processor  (ASP)  testbed  developed 
by  Aspex  Microsystems  [Habiger  and  Lea,  1993] 
and  the  Vaster  processor  [Loucks,  1982]  employ 
three-level  control  hierarchies,  adding  an  addi¬ 
tional  level  of  interpretation  between  the  host 
computer  and  the  microcontroller. 

While  the  conventional  hierarchical  control 
strategy  may  be  suitable  for  SIMD  supercom¬ 
puters,  it  is  not  appropriate  for  less  expensive 
systems.  One  drawback  is  that  to  generate  ar¬ 
ray  instructions  at  a  rate  commensurate  with 
the  speed  of  the  processing  element  array,  a  so¬ 
phisticated,  fast  microcontroller  is  needed.  The 
microcontroller  design  must  include  one  or  more 
functional  units  to  perform  the  arithmetic  and 
logical  operations  involved  in  producing  array 
instructions.  Given  the  amount  of  computation 
generally  required  to  decode  macroinstructions 
and  produce  array  instructions,  the  frequency  of 
the  controller  clock  must  be  several  times  that 
of  the  array  clock. 

A  second  drawback  to  conventional  hierarchi¬ 
cal  control  is  the  burden  of  providing  appropri¬ 
ate  software  support.  With  both  a  control  path 
and  one  or  more  functional  units,  the  microcon- 
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troller  amounts  to  a  special-purpose  computer. 
Separate  programs  are  required  for  the  host  and 
controller.  Getting  these  two  programs  to  work 
together  can  be  more  than  twice  as  hard  as  writ¬ 
ing  a  single  program.  The  microcontroller  also 
requires  its  own  set  of  development  tools,  in¬ 
cluding  a  compiler  and  a  debugger. 

Run-time  instruction  computation  is  not  needed 
for  real-time  image  processing.  The  same  low- 
level  tasks  are  repeated  for  each  image.  An  effi¬ 
cient  control  strategy  exploits  this  property.  Se¬ 
quences  of  array  instructions  are  generated  by 
the  host  computer  before  processing  begins  and 
are  stored  in  the  controller.  Macroinstructions 
are  reduced  to  simple  calls  telling  the  controller 
which  sequence  to  send  to  the  processing  ele¬ 
ment  array.  The  controller  can  be  simplified 
because  it  does  not  have  to  decode  and  inter¬ 
pret  complex  macroinstructions. 

With  our  controller  architecture,  sequences  of 
microinstructions,  generated  by  the  host,  are 
held  in  the  control  store.  Microinstructions  in¬ 
clude  both  array  instructions  and  sequencer  in¬ 
puts.  To  initiate  a  sequence  of  microinstruc¬ 
tions,  the  host  computer  writes  the  starting  ad¬ 
dress  of  the  sequence  into  the  opcode  register. 
The  sequencer  steps  through  the  control  store, 
producing  one  array  instruction  every  clock  cy¬ 
cle. 

Our  control  strategy  has  several  important  fea¬ 
tures: 

(1)  Simple  controller  hardware 

Since  array  instructions  are  generated  by  the 
host  computer,  the  controller  need  not  perform 
arithmetic  and  logical  operations.  Thus,  the 
controller  does  not  include  a  data  path.  In¬ 
stead,  array  instructions  are  merged  into  the 
control  path.  Furthermore,  the  controller  need 
not  operate  with  a  higher  clock  frequency  than 
the  processing  element  array. 

(2)  High  array  utilization 

During  processing,  the  host  computer  need  not 
issue  individual  instructions  to  the  processing 
element  array.  Thus,  provided  that  sequences 
are  sufficiently  long,  array  utilization  is  not 
unacceptably  limited  by  host  speed  or  by  bus 
transaction  delays. 


(3)  Unified  software  development 

All  system  software  can  be  created  and  com¬ 
piled  on  the  host  computer  with  existing  soft¬ 
ware  development  tools. 

3  Chip  Design 

A  processing  element  combines  a  128-bit 
DRAM  column  with  one-bit-wide  logic.  Three 
latches  provide  inputs  to  a  function  genera¬ 
tor.  Eight  control  signals  select  between  the 
256  three- input  Boolean  functions.  Additional 
latches  hold  write  data  and  provide  a  local 
write-enable  signal. 

Processing  elements  are  interconnected  to  cre¬ 
ate  a  two-dimensional  rectangular  array,  match¬ 
ing  the  structure  of  image  data.  Input  signals 
from  adjacent  processing  elements  are  combined 
with  the  function  generator  result  according  to 
control  signals.  With  the  interconnection  net¬ 
work  extended  across  chip  boundaries,  multiple 
chips  may  be  used  to  form  processing  element 
arrays  matching  the  size  of  large  images. 

The  chip  was  fabricated  in  a  3.3  ASIC  process 
available  through  the  MOSIS  Service.  The  chip 
has  eight  blocks  of  512  processing  elements.  At 
the  center  of  each  block  is  a  twin  cell  DRAM 
array  with  128  rows  and  512  columns.  The 
cell  structure  is  similar  to  early  planar  cells 
with  metal  bitlines  and  polysilicon  wordlines 
and  platelines  [Rideout,  1979],  but  incorporates 
a  second- level  metal  wordline  shunt.  Sense  am¬ 
plifiers  are  placed  at  the  bottom  of  the  DRAM 
array. 

The  layout  pitch  of  processing  element  logic  is 
double  the  memory  column  pitch.  Arrays  of 
256  logic  units  are  placed  above  and  below  the 
DRAM  array.  There  are  no  column  decoders — 
logic  circuits  are  connected  directly  to  the  bit¬ 
lines.  The  pitch-matched  processor  implemen¬ 
tation  maximizes  the  bandwidth  between  mem¬ 
ory  and  logic  and  minimizes  processing  element 
area  [Elliot  et  ah,  1979].  Interface  circuits  use 
a  3.3V  supply  to  make  input  and  output  levels 
compatible  with  standard  3.3V  parts.  Wordline 
drivers  and  platelines  also  use  a  3.3V  supply. 
All  other  circuits  use  a  2.5V  supply.  The  chip 
characteristics  are  listed  in  Table  1. 
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Table  1:  Chip  Characteristics 


Channel  length 

0.6  /xm  (drawn) 

Polysilicon  pitch 

1.5  /xm  (w/o  contacts) 

Metal  1  pitch 

2.1  /xm  (contacted) 

Metal2  pitch 

2.4  /xm  (contacted) 

Metal3  pitch 

3.0  /xm  (contacted) 

Proc.  elements 

4096  (64  X  64) 

Memory 

512  Kb  (twin  cell  DRAM) 

Devices 

2.7  M 

Pads 

144 

Die  size 

9.7  X  8.1  mm^ 

Memory  cell  size 

7.2  X  7.2  /xm^ 

Power  supplies 

Vdd  —  2.5  V  (internal) 
Vhh  —  3-3  V  (interface) 
Vpp  =  3.3  V  (wordline) 

Cycle  time 

60  ns 

Pow.  dissipation 

300  mW  (typical) 

4  Programming  Environment 

Very  fine-grained  parallel  processors  with  one- 
bit- wide  processing  element  •  logic  present  chal¬ 
lenging  software  problems.  Instruction  sets  are 
very  primitive:  simple  parallel  arithmetic,  com¬ 
parison,  and  data  movement  operations  require 
sequences  of  many  array  instructions.  In  addi¬ 
tion,  different  processing  element  array  imple¬ 
mentations  employ  different  memory  structures, 
provide  different  instructions,  and  require  dif¬ 
ferent  computational  algorithms.  To  facilitate 
application  development  and  maintenance,  we 
created  a  programming  framework  supporting 
our  system  structure.  The  framework  hides  de¬ 
tails  of  array  and  controller  implementations, 
allowing  programmers  to  focus  on  application 
issues. 

We  developed  the  programming  framework  us¬ 
ing  the  C-H- 1-  programming  language  [Strous- 
trup,  1991].  C+-I-  provides  facilities  for  defining 
new  types  (classes)  that  act  in  the  same  way  as 
built-in  types.  In  effect,  these  facilities  allow  the 
language  to  be  augmented  to  support  additional 
concepts.  We  used  this  capability,  implement¬ 
ing  the  framework  as  a  library  of  C-I-+  classes. 

A  processing  element  array  coupled  with  a  con¬ 
troller  is  represented  by  a  system  class.  Sys¬ 


tems  classes  are  implemented  using  an  interface 
to  array  and  controller  hardware  or  using  ar¬ 
ray  and  controller  simulation  code.  The  simu¬ 
lation  option  provides  a  valuable  tool  for  archi¬ 
tecture  development  work.  An  object  of  class 
sequence  represents  a  set  of  controller  and  ar¬ 
ray  instructions  which  may  be  loaded  into  a  re¬ 
gion  of  the  control  store  on  the  controller  and 
invoked  within  an  application.  Application  pro¬ 
grams  employ  processing  element  arrays  by  di¬ 
recting  the  execution  of  sequences  by  systems. 

5  Experimental  Results 

Edge  detection,  template  matching,  optical 
flow,  median  filtering,  stereo  matching,  and 
smoothing  and  segmentation  have  been  imple¬ 
mented  on  the  system  producing  real-time  re¬ 
sults.  The  underlying  parallelism  in  these  ap¬ 
plications  provide  excellent  examples  to  present 
the  processors  capabilities.  This  section  details 
one  of  the  applications  and  later  displays  the 
processing  times  of  the  applications  tested. 

5.1  Template  Matching 

Template  matching  consists  of  comparing  a 
template  image  to  another  image  to  determine 
if  the  a  portion  of  the  image  is  the  same  as  the 
template.  The  algorithm  used  determines  the 
matching  by  computing  the  difference  between 
the  template  and  the  image  being  matched.  The 
template  is  normally  smaller  than  the  image 
that  is  being  searched.  Every  location  is  tested 
by  calculating  the  sum  of  the  differences  at 
each  location.  Figure  3  shows  the  template  and 
Figure  4  shows  the  image  which  was  searched. 
The  white  square  marks  where  the  template  has 
matched  the  image.  The  processing  time  for  this 
template  matching  operation  is  6.3  ms.  The 
template  size  which  truly  determines  the  num¬ 
ber  of  instructions  is  21x31  pixels.  The  size  of 
the  searched  image  is  not  relevant  in  the  speed 
because  the  operations  are  performed  in  paral¬ 
lel.  Every  pixel  of  the  image  is  tested  with  a 
pixel  in  the  template  image  at  the  same  time. 
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6  Future  Work 


Figure  3:  21x31  Template  Image  &  Test  Image 


Two  efforts  are  underway.  First,  the  array  size 
is  being  expanding  from  128x128  to  256x256 
without  increasing  the  board  size.  To  avoid 
the  increase  of  the  board  size,  the  array  pro¬ 
cessor  chips  are  being  repackaged  with  smaller 
packages  and  more  dense  board  are  being  de¬ 
signed.  Second,  the  host  computer  is  being 
switched  from  a  workstation  to  PC  for  better 
cost-performance . 


Table  2:  Execution  times  of  applications  im- 


plemented  on  system. 


App. 

Spec. 

Exec. 

Smoothing  & 
Segmentation 

200  convolution 
threshold  20 

4.8  ms 

Edge 

Detection 

vertical  & 
horizontal  edges 

277  /rs 

Template 

Matching 

20x20 

template 

3.25  ms 

5x5  Median 
Filtering 

274  fis 

Optical  flow 

5x5  sup.  region 

7  pixel  max  disp. 

9.9  ms 

Stereo 

Matching 

4.0  ms 

5.2  Processing  Times 

This  section  contains  some  processing  time  re¬ 
sults  for  the  applications  implemented  on  the 
system.  The  software  module  can  also  estimate 
the  processing  time  of  the  hardware  module. 
When  running  the  software  module  a  count  of 
array  instructions  can  be  obtained  which  can  be 
used  to  calculate  a  conservative  approximation 
of  the  processing  time  in  the  hardware  module. 

Execution  Time  =  #  of  instructions  x  60  ns  -f 
^  of  sequence  executions  x  2.5  /iS 

The  60  ns  is  the  clock  speed  of  the  controller  and 
2.5  fis  is  the  amount  of  time  for  the  controller 
and  desktop  to  interface.  The  execution  times 
are  generally  less  than  (1/0.9)  x  of  instruc¬ 
tions  X  60  ns.  Actual  execution  time  depends  on 
array  utilization.  Table  2  shows  the  execution 
times  of  various  applications  implemented  on  8 
bit  images. 
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Abstract 

Image  databases  are  finding  increased  use  in  mul¬ 
timedia  tasks  such  as  pattern  recognition  and  im¬ 
age  library  searches.  A  content  based  approach  to 
image  database  searching  is  useful  in  a  variety  of 
tasks  such  as  object  recognition,  and  image  under¬ 
standing.  Typically,  this  involves  time-consuming 
searches  through  the  database  to  find  relevant  im¬ 
ages.  The  I/O  bandwidth  required  to  quickly  hash 
thorugh  a  large  database  can  be  on  the  order  of 
gigabits-per-second.  We  have  designed  VLSI  compo¬ 
nents  for  searching  databases  with  high-dimensional 
data,  such  as  image  templates,  to  find  the  k  closest 
examples  to  an  input  query  as  determined  by  a  pro¬ 
grammable  metric  using  a  massively  parallel  search. 
This  nearest-neighbor  approach  can  be  used  directly 
for  classification,  or  in  conjunction  with  any  number 
of  algorithms  that  exploit  ’local’  information.  We 
have  integrated  the  search  hardware  on  a  single  die 
with  16  Mbits  of  DRAM  to  perform  a  highly  paral¬ 
lel  search  with  an  on-chip  memory  badwidth  of  6.4 
Gbits/second.  Multiple  chips  can  be  connected  to 
form  a  system  capable  of  over  100  Gbits/second. 

1  Introduction 

Image  databases  are  exploited  in  multimedia  appli¬ 
cations  such  as  pattern  recognition  and  image  un¬ 
derstanding  and  classification.  For  example,  con¬ 
sider  OCR  (optical  character  recognition).  Several 
examples  of  handwritten  characters  can  be  stored 
as  16  X  16  templates,  or  256-dimensional  data  vec¬ 
tors.  Given  a  query  character  template,  the  retrieval 
of  the  best  matches  to  the  query  from  the  database 
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can  produce  a  classification  match.  Other  examples 
include  scanning  a  database  to  find  various  objects 
and  featrues,  such  as  faces,  or  cars. 

The  nearest-neighbor  algorithm  can  be  used  to  re- 
trive  close  matches  to  an  input  query,  as  defined  by  a 
given  distance  metric.  Operations  can  be  performed 
on  the  retreived  matches  to  make  a  classification  de- ' 
cision,  or  can  be  used  in  conjunction  with  more  com¬ 
plex  algorithms  as  a  tool  for  image  understanding. 
Though  there  are  software  implementations  of  near¬ 
est  neighbor  searches  utilising  tree  structures  to  in¬ 
dex  quickly  into  a  multidimensional  database  in  log¬ 
arithmic  time  in  the  number  of  stored  data  vectors, 
they  have  an  exponential  dependence  on  the  dimen¬ 
sionality  of  the  input  data.  For  highly-dimensional 
data  they  essentially  perform  an  exhaustive  search 
of  the  database,  [l]  [2]  Thus,  a  highly  parallel 
exhaustive  search  is  the  most  practical  for  highly- 
dimensional  data. 

Though  the  nearest-neighbor  algorithm  requires 
massive  computation,  memory,  and  I/O,  a  hardware 
implementation  can  make  it  practical  for  many  ap¬ 
plications.  We  have  pursued  a  unified  approach, 
combining  memory,  distance  computation,  and  sort¬ 
ing  on  a  single  chip  to  perform  a  highly  parallel  near¬ 
est  neighbor  search  in  hardware.  The  design  is  flex¬ 
ible,  allowing  for  a  wide  range  of  data  sizes,  and  full 
scalability  of  memory  size,  processing  speed,  and  the 
number  of  close  matches  that  can  be  retrieved. 

A  short  review  of  previous  hardware  implementa¬ 
tions  will  be  given,  followed  by  a  description  of  the 
chip  architecture,  and  it’s  prospective  performance. 

2  Previous  Hardware  Approaches 

An  examination  some  of  previous  hardware  ap¬ 
proaches  for  the  nearest  neighbor  algorithm  reveals 
a  few  drawbacks.  [3]. 
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Some  chips,  such  as  the  Intel  NilOOO  [4]  and  the  SGS- 
Thomson  CAM  [5]  offer  storage  of  a  few  thousand 
data  vectors  on  chip,  with  a  fast  nearest-neighbor 
classification  result,  but  are  much  too  restrictive  in 
terms  of  small  fixed  data  vector  sizes,  (256  dimen¬ 
sions  and  64  dimensions  respectively),  and  poor  bit 
resolution,  which  limits  them  to  a  few  special  pur¬ 
pose  applications.  Others,  with  better  bit  resolution 
such  as  the  IBM  ZISC036  chip  offer  only  a  limited 
amount  of  on  chip  data  storage,  (36  data  vectors  for 
the  ZISC036)  which  is  too  small  to  store  a  reason¬ 
ably  sized  database. 

There  are  some  embedded-DRAM  chips  [6]  [7] 
which  contain  SIMD  parallel  processors  with  embed¬ 
ded  DRAM,  capable  of  performing  nearest-neighbor 
computations,  but  they  are  implemented  with  only  a 
few  hundred  bits  of  DRAM  per  processor,  and  offer 
limited  communication  between  processors,  which 
provides  an  inferior  implementation  of  communica¬ 
tion  heavy  algorithms  like  the  nearest-neighbor  com¬ 
putation. 

The  high  data  bandwidth  and  communication  re- 
quirments  of  the  nearest  neighbor  application  should 
beconsidered  in  the  design  of  hardware,  as  well  as 
ample  storage  for  large  databases,  and  highly  dimen¬ 
sional  data. 

3  Architecture  for  k-Nearest 
Neighbors  Retrieval 

An  exhaustive  database  search  for  k-nearest  neigh¬ 
bors  consists  of  two  basic  operations  -  distance  cal¬ 
culation  and  sorting.  Distances  between  an  input 
query,  and  each  example  in  a  database  must  be  cal¬ 
culated,  with  the  results  sorted  in  a  priority  queue 
of  size  k  such  that  only  the  k  smallest  distances 
are  kept.  The  distance  calculations  are  easily  paral¬ 
lelized  by  assigning  a  distance  calculator  to  each  ex¬ 
ample  in  an  example  database.  Once  the  distances 
are  computed  by  each  processor,  they  can  be  sorted 
in  parallel,  as  new  distances  are  calculated.  This 
basic  parallel,  pipelined  architecture  is  illustrated  in 
figure  1,  which  shows  the  three  basic  components  of 
the  architecture  -  memory,  distance  processing,  and 
sorting.  The  level  of  parallelism  is  restricted  only  by 
the  practical  concerns  of  chip  area,  and  flexibility  for 
example  databases  of  varying  size  and  dimensional¬ 
ity. 

Custom  hardware  was  built  with  a  bit-serial  design, 
allowing  for  a  simple  interface  between  memory,  pro¬ 
cessors,  and  sorters,  as  well  as  high-speed  operation. 
Each  clock  cycle,  a  single  bit  is  read  from  the  mem¬ 
ory  into  each  of  the  processors.  With  a  large  number 
of  memory  banks  on  chip,  and  clock  speeds  ranging 


Generalized  k-Nearest  Neighbors  Architecture 


Figure  1:  Block  diagram  of  architecture.  Exam¬ 
ples  are  stored  in  memory,  and  presented 
in  parallel  to  distance  calculation  proces¬ 
sors  along  with  a  common  input  query. 
The  distances  are  computed,  and  sorted 
in  parallel. 

from  100Mhz-200Mhz,  a  high  memory  bandwidth 
can  be  achieved. 

4  Hardware  Implementation 
4.1  Specifications 

Target  design  specifications  for  the  hardware  have 
been  selected  to  be  appropriate  for  a  large  range  of 
potential  applications.  The  system  should  be  scal¬ 
able  to  store  a  large  number  of  examples,  in  the 
range  of  10®  to  10^  examples.  The  number  of  near¬ 
est  neighbors  that  can  be  retrieved  should  also  be 
scalable.  As  a  compromise  between  flexibility,  and 
ease  of  hardware  implementation,  a  programmable 
weighted  Ll-norm  was  chosen  as  the  distance  metric: 

n 

Distance  =  ^  cti\xi  —  yi\  (1) 

The  constants  are  chosen  to  emphasize  the  im¬ 
portance  of  selected  example  features  or  dimensions. 
The  example  vectors  are  encoded  using  16  bit  signed 
integers  for  each  dimension,  and  8  bit  positive  un¬ 
signed  integer  weights.  The  processors  will  hold  32 
bit  accumulated  distance  measurements.  Given  the 
example  and  weight  bit  resolutions,  examples  up  to 
256  dimensions  can  be  accommodated  without  over¬ 
flow  in  the  accumulated  distance  measure. 

Examples  of  larger  dimensionality  can  be  accommo¬ 
dated  by  reducing  the  bit  precision  of  the  weights,  or 
by  assuming  that  examples  which  produce  overflow 
are  too  far  away  from  the  query  to  be  considered 
relevant,  and  can  be  ignored. 
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4.2  An  Embedded  Memory  System 

The  custom  distance-calculation  and  sorting  hard¬ 
ware  was  merged  with  commercial  DRAM.  Specifi¬ 
cally,  a  16-Mbit  DRAM  organized  into  64  256-Kbit 
blocks,  incorporating  64  distance  calculation  proces¬ 
sors,  and  a  64-entry  sorting  array.  Figure  2  shows 
a  die  photo  of  the  chip.  Each  256  Kbit  block  is 
interfaced  with  an  Ll-Norm  processor,  which  is  in 
turn  connected  to  a  64  x  1  array  of  sorters,  acting 
as  a  single  priority  queue  slot.  The  total  die  area 
is  roughly  17mm  x  17mm.  Though  the  DRAM  is 
designed  in  a  0.35  micron  process,  the  unsuitibility 
of  the  fabrication  process  for  standard  logic  requires 
1.0  micron  design  rules  for  the  custom  layout.  Im¬ 
provements  in  the  process  should  allow  for  signifi¬ 
cant  reductions  in  die  area.  The  operation  of  the 
chip  can  take  place  at  speeds  between  100-200  Mhz, 
as  the  DRAM  data  is  slowly  sensed  in  parallel,  and 
then  fed  to  the  processors  at  a  high  rate  in  a  se¬ 
rial  fashion,  insuring  a  new  bit  per  clock  cycle  for 
each  processor.  Runing  at  100  Mhz,  a  chip  of  64 
processors  would  have  a  peak  memory-to-processor 
bandwidth  of  6.4  Gbits/second.  Multiple  chips  can 
be  used  to  scale  processor  speed,  memory  storage, 
and  the  number  of  nearest  neighbors  that  can  be 
retrieved. 


Figure  2:  Die  photograph  of  embedded  DRAM 
chip. 

5  System  Issues 

5.1  Projected  Performance 

As  an  example  of  system  performance,  consider  a 
database  of  4,000  16  x  16  templates  of  written  char¬ 


acters.  One  template  requires  4  Kbits  of  storage  (16 
bits  per  dimension  *  256  dimensions).  4,000  of  these 
templates  requires  16  Mbits  of  storage,  which  fits  on 
a  single  embedded-DRAM  chip.  An  HP-755  work¬ 
station  takes  about  180ms  to  retrieve  the  64  closest 
matches  to  an  input  query.  At  100  MHz,  a  single  bit 
serial  Ll-norm  processor  would  require  only  160ms 
to  retrieve  the  64  closest  matches  (The  Ll-norm  pro¬ 
cessor  loads  one  bit  per  clock  cycle,  and  will  take  16 
Mcycles  to  process  all  the  examples.  At  100  MHz, 
this  takes  160ms,  if  we  ignore  pipeline  latency.)  A 
single  chip  with  64  processors  would  require  2.56ms 
to  retrieve  the  64  closest  matches,  and  a  system  with 
16  chips  working  together  would  require  only  160/rs. 
Thus,  speedup  factors  of  greater  than  10®  are  feasi¬ 
ble  using  the  special-purpose  hardware.  Other  im¬ 
age  data,  color  templates  of  faces,  or  medical  images 
for  example,  can  be  stored  and  searched  in  a  similar 
manner.  The  massive  I/O  bandwidth  between  mem¬ 
ory  and  processing  on  the  special  purpose  hardware 
can  perform  at  rates  above  100  Gbits/second  for  a 
system  of  16  chips  containing  32  Mbytes  of  DRAM 
connected  to  1024  processors  operating  in  parallel. 
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Abstract 

We  present  a  coarse-grained,  reconfigurable  copro¬ 
cessor  for  compute  and  I/O  intensive  image  pro¬ 
cessing  tasks.  While  FPGA  based  reconfigurable 
coprocessors  utilize  fine-grain,  general  purpose,  ho¬ 
mogenous  functional  blocks,  we  have  implemented  a 
coarse-grain,  heterogenous  architecture  with  systolic 
processor  arrays,  programmable  length  line  buffers, 
and  specialized  processor  blocks  which  is  signifi¬ 
cantly  faster  and  more  area  efficient.  Using  0.6 
^m  CMOS  technology,  the  180,000  transistor  im¬ 
age  coprocessor  was  implemented  in  only  4.12mm 
X  2.59mm,  performed  7Gops/sec  at  33MHz,  and 
consumed  600mW  at  3.3V  power  supply.  The  per¬ 
formance  of  the  reconfigurable  coprocessor  is  esti¬ 
mated  for  a  typical  computer  vision  tasks  such  as 
block  matching,  database  comparison,  and  template 
searching. 

1  Introduction 

Machine  vision  tasks  are  characterized  by  the  ex¬ 
tremely  demanding  computations  and  I/O  band¬ 
width  requirements  for  even  simple  real-time  visual 
inspection  and  analysis  tasks.  While  typical  audio 
processing  tasks  can  be  implemented  in  real-time 
with  DSP  chips,  the  I/O  bandwidth,  memory  capac¬ 
ity,  and  computation  throughput  required  for  image 
processing  tassks  are  several  orders  of  magnitude  be¬ 
yond  the  capabilities  of  even  supercomputer  perfor¬ 
mance.  In  particular,  it  has  been  estimated  that 
at  least  several  hundred  terafiops  would  be  required 
to  approach  the  basic  capabilities  of  human  visual 
systems. 


‘This  work  was  supported  in  part  by  an  NSF  Young  In¬ 
vestigator  Award  (MIP-92-57964),  a  grant  for  High  Perfor¬ 
mance  Computing  and  Communication  from  NSF/DARPA 
(GC-R-334138),  and  a  grant  from  the  ONR  Multidisciplinary 
University  Research  Initiative  (CT-S-608521). 


Although  several  orders  of  magnitude  performance 
improvement  over  state-of-the-art  microprocessors 
can  be  realized  through  the  careful  design  and  im- 
plemenation  of  full  custom  VLSI  coprocessors,  these 
special  purpose  processors  are  only  optimized  to  per¬ 
form  a  single  task.  In  response,  reconfigurable  copro¬ 
cessors  have  been  proposed  as  a  possible  solution  and 
have  been  implemented  using  FPGAs  (Field  Pro¬ 
grammable  Gate  Arrays)  with  fine-grained,  homoge¬ 
nous,  general  purpose  functional  blocks  [Knittel  et 
al.,  1996].  In  contrast,  we  present  a  novel  coarse¬ 
grained,  reconfigurable  image  coprocessor  which  can 
be  efficiently  programmed  for  high-speed  execution 
of  a  range  image  comparison  tasks  such  as  block 
matching,  robust  template  correlation,  and  pyra¬ 
midal  image  searching.  This  image  coprocessor  is 
called  the  G2  (generation  two)  here  to  distinguish 
it  from  an  earlier  design  (the  G1  [Gilbert  and  Yang, 
1993])  which  utilized  a  fixed  systolic  processor  array. 

Block  matching  (or  image  correlation  ^)  is  compu¬ 
tationally  and  I/O  intensive  and  is  encountered  in 
several  areas  of  computer  vision  and  image  pro¬ 
cessing  such  as  automated  visual  inspection  sys¬ 
tems,  face  recognition  and  identification  systems, 
motion  estimation,  and  video  compression  systems 
employing  motion  estimation  (e.g.  many  MPEG  and 
H.261  compliant  systems).  The  very  large  computa¬ 
tional  expense  of  block  matching  task  requires  as 
much  as  as  three  quarters  of  the  total  processing 
power  in  typical  video  codecs  [Fujiwara  et  al.,  1992]. 
This  expense  is  obviously  greatest  in  the  case  of 
full  search  block  matching,  as  compared  to  reduced 
search  schemes,  some  of  which  are  detailed  in  [Pirsch 
et  al.,  1995].  Pull  search  block  matching  however  has 
the  desirable  properties  of  relatively  simple  control 
and  guaranteed  correct  results  and  is  an  ideal  can¬ 
didate  for  VLSI  implementation  due  to  the  inherent 

^This  term  is  equivalent  to  block  matching  and  will  be 
used  interchangeably. 
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parallelism  which  can  be  exploited  to  compute  block 
matching  scores  in  real-time.  The  core  systolic  pro¬ 
cessor  array  in  the  G2  is  optimized  for  the  full  search 
block  matching  task  and  is  utilized  as  a  basic  primi¬ 
tive  function  for  performing  a  variety  of  other  tasks. 


2  Architecture 

The  G2  processor  was  designed  as  a  high-speed  co¬ 
processor  for  use  over  fast  local  buses.  In  particular, 
the  180,000  transistor  design  was  fabricated  in  0.6 
pm  CMOS  technology  on  a  4.12mm  x  2.59mm  die, 
performed  7  Gop/s  at  33  MHz  (PCI  bus  speed),  and 
consumed  600mW  at  3.3V.  Testing  of  this  design  at 
66MHz  for  use  with  next  generation  PCI  buses  is 
currently  being  performed. 

A  block  diagram  of  the  G2  chip  is  shown  in  Figure 
1.  This  is  a  dataflow  driven  architecture  with  image 
data  typically  input  as  a  serial  raster  scan.  Control 
and  data  are  entered  byte-wise  on  common  input 
pins.  An  integrated  programmable  control  unit  de¬ 
tects  the  control  bytes  and  reconfigures  the  proces¬ 
sor  architecture  by  setting  the  variable  length  line 
buffers  in  the  systolic  array,  multiplexing  the  im¬ 
age  decimation/interpolation  processors,  initializa¬ 
tion  of  the  internal  control  sequencing,  and  possible 
execution  of  an  automatic  comparison  and  minimum 
detect  across  the  64  processing  elements  (PEs).  Fol¬ 
lowing  reconfiguration,  image  data  (8  bit  grey-level 
pixels)  is  input  until  the  image  correlation  opera¬ 
tion  is  complete;  no  further  instructions  or  external 
control  are  required.  A  database  image  (reference 
block)  is  correlated  against  a  target  image  (search 
area).  Nominally,  an  8x8  PE  array  on  a  single  chip 
can  support  up  to  a  -3/-t-4  search  area.  Maximum 
image  width  is  set  by  the  length  of  the  line  buffers  at 
128  pixels  at  full  resolution  (no  decimation).  How¬ 
ever  since  the  variable  length  line  buffers  are  in¬ 
dividually  programmable,  a  variety  of  other  search 
ranges,  search  areas,  and  image  sizes  can  be  easily 
configured. 

Each  processing  element  implements  a  mean  abso¬ 
lute  difference  (MAD)  computation,  consisting  of  an 
absolute  difference  (subtract,  sign  change/retain), 
and  accumulation  to  the  value  previously  stored  in 
the  PE.  The  subtractor  is  8-bits  wide;  the  accumu¬ 
lator  is  a  24-bit  saturating  adder  which  permits  up 
to  23-bit  differences  (i.e.  128  x  256  images)  to  be 
handled  at  full  accuracy.  Much  larger  images  can 
be  handled  if  distinction  of  MAD  values  which  ex¬ 
ceed  23-bits  is  not  needed  (which  is  the  case  for  most 
applications).  The  G2  chip  also  includes  two  proces¬ 
sors  for  bilinear  image  interpolation  (1  to  2  x  2)  or 
decimation  (2  x  2  to  1).  In  addition,  a  MAD  mini¬ 
mum  detection  processor  can  be  utilized  to  find  both 
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Figure  1:  G2  Chip  Architecture 


Shared  Address  Lines 


Figure  2:  Coprocessor  Board  with  4  G2  Correlator 
Chips 


the  address  location  and  the  minimum  MAD  value 
amongst  the  64  PEs.  Clocking  and  control  signals 
are  generated  by  an  integrated  programmable  finite 
state  machine. 

In  order  to  maximize  the  utilization  of  the  PCI  bus 
bandwidth,  four  G2  chips  can  be  directly  interfaced 
to  the  32  bit  PCI  bus  and  to  four  local  SRAM  caches 
for  the  target  images  as  shown  in  Figure  2.  In  the 
simplest  configuration,  all  of  the  address  lines  to  the 
four  SRAM  caches  and  four  G2  chips  are  shared  in 
common.  Thus,  the  four  G2  chips  can  directly  used 
in  parallel  with  common  or  distinct  target  images 
to  be  searched  over  a  large  database  of  search  im¬ 
ages.  There  are  several  other  ways  of  programming 
the  chip  to  enhance  image  search  area,  image  size, 
etc.  More  details  are  given  in  [Bugeja  and  Yang, 
1997].  In  general  a  large  number  of  these  chips  can 
be  combined  in  parallel/cascade  (cascade  outputs 
are  included  at  the  end  of  the  dataflow  chain  for  this 
purpose)  to  realize  very  powerful  configurations. 
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3  Applications  and  Benchmarks 

Block  matching  is  used  in  several  different  contexts; 
the  G2  chip  implements  block  matching  of  large  tem¬ 
plates  over  moderate  search  ranges  (-3/-I-4  for  a  sin¬ 
gle  chip  operating  in  normal  mode).  This  makes  it 
ideally  suited  for  applications  involving  image  recog¬ 
nition.  For  example  [Gilbert  and  Yang,  1993]  de¬ 
scribes  a  facial  recognition  system  which  uses  the 
G1  chip  as  the  central  element  in  a  system  capable  of 
recognizing  a  face  in  front  of  a  camera  by  comparing 
against  a  large  database  of  images;  block  matching 
is  required  for  compensation  of  image  misalignment 
and  hence  robust  template  registration.  The  Gl  sys¬ 
tem  acheived  correlation  speeds  4  times  faster  than  a 
150MHz  Pentium  running  optimized  assembly  code, 
and  could  process  500  database  images  in  1  second 
over  a  search  area  of  -2/ -I- 2;  identification  rates  of 
88%  were  measured  using  cross-validation.  A  four 
chip  G2  system  such  as  that  shown  in  Fig.  2,  on 
the  other  hand,  implements  a  search  area  of  -3/-(-4 
for  even  better  identification  performance,  and  can 
process  8000  images  in  1  second  on  a  PCI  board  run¬ 
ning  at  33MHz,  thus  acheiving  a  64x  speedup  over 
the  150MHz  Pentium. 

The  G2  may  also  be  used  in  other  applications  such 
as  template  search.  The  comparison  of  a  large  num¬ 
ber  of  templates  against  a  large  search  area  is  an 
essential  task  for  many  machine  vision  tasks  but 
requires  substantial  processing  time  and  computing 
resources.  In  a  typical  practical  example,  the  tem¬ 
plates  might  typically  be  fiducial  marks  of  some  sort, 
whilst  the  search  area  might  typically  be  an  image  of 
a  printed  circuit  board  within  which  these  fiducials 
are  to  be  located  for  registration  and  detailed  in¬ 
spection.  Unlike  recognition  applications,  which  are 
characterized  by  the  correlation  of  database  images 
against  target  images  of  only  slightly  larger  size  (e.g. 
128x128  vs  135x135),  search  applications  are  typ¬ 
ically  characterized  by  large  target  templates  com¬ 
pared  to  the  images  to  be  located  within  them. 

We  describe  the  application  of  the  G2  for  an  example 
task  involving  the  location  of  a  128x128  template 
within  a  512x512  target.  The  scheme  employs  a 
two-level  pyramid  search  involving  search  of  a  32x32 
template  within  a  128x128  version  of  the  target 
(decimation  by  four)  as  a  first  step.  The  128x128 
template  is  broken  down  into  256  overlapping  39x39 
blocks,  and  256  correlations  of  the  32x32  template 
on  each  of  these  blocks  are  performed  (taking  time 
256x45us  on  a  single  G2).  Typically  around  20 
good  matches  are  returned;  64x64  templates  are 
then  compared  against  a  256x256  target  (decima¬ 
tion  by  2),  i.e.  20  64x64  correlations  (20xl51us). 
Finally,  assuming  around  4  good  locations  are  re¬ 
turned,  4  correlations  at  full  128x128  resolution  are 


performed  to  obtain  the  final  best  match  (4x550us). 
A  four  chip  G2  system  can  execute  this  algorithm  in 
4.2ms. 

4  Conclusions 

A  coarse-grain  reconfigurable  coprocessor  for  com¬ 
pute  and  I/O  intensive  image  processing  tasks  has 
been  presented.  The  chip  is  designed  around  a  core 
capable  of  carrying  out  block-matching  type  algo¬ 
rithms  with  high  efficiency  as  a  basic  function;  many 
other  tasks  can  then  be  handled  by  appropriate  pro¬ 
gramming  of  the  core  and  other  components  avail¬ 
able  on  chip.  Typical  applications,  and  the  perfor¬ 
mance  which  systems  constructed  using  the  chip  can 
acheive,  have  been  briefly  summarized. 
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Abstract 

We  develop  a  new  general  formalism  for 
dealing  with  physics-based  processes  of  im¬ 
age  formation.  Particular  attention  is  given 
to  the  shape-from-shading  problem,  but 
other  sensors,  such  as  infrared,  can  also 
be  treated.  Unlike  previous  shape-from- 
shading  methods,  our  formalism  is  three- 
dimensional  and  treats  depth  on  an  equal 
footing  with  the  image  coordinates.  This 
makes  it  possible  to  treat  the  problem 
in  perspective  projection,  rather  than  the 
usual  orthographic  projection.  The  for¬ 
malism  does  not  make  specific  assump¬ 
tions  about  the  surface,  but  makes  it  easy 
to  incorporate  a  model-based  assumption 
in  the  recovery  of  the  shape.  For  exam¬ 
ple,  we  can  assume  that  the  shape  is  lo¬ 
cally  a  quadric  surface.  This  enables  us 
to  recover  a  surface  without  reliance  on 
maximal-brightness  points  or  on  occlud¬ 
ing  boundaries,  with  their  serious  instabil¬ 
ity  problems.  The  formalism  is  based  on 
an  adaptation  of  variational  calculus,  in¬ 
cluding  Hamilton’s  equations  and  their  in¬ 
variance  properties.  In  particular,  scale- 
invariance  is  shown  to  be  a  useful  property 
of  the  shading  process. 

1  Introduction 

The  shape-from-shading  problem  has  usually  been 
treated  in  a  formalism  introduced  by  Horn  (e.g.  [5]). 
This  formalism  assumed  that  we  have  a  certain  di¬ 
rection  is  space  which  is  the  “viewer  direction” ,  and 

The  author  is  grateful  for  the  support  of  the  Air 
Force  Office  of  Scientific  Research  under  Grant  F49620- 
96-1-0355,  and  of  the  Office  of  Naval  Research  under 
Grant  N00014-95-1-0521. 


shapes  are  orthographically  projected  on  an  image 
in  this  direction. 

There  are  several  problems  with  this  formalism. 
First,  the  orthographic  projection  is  only  an  approx¬ 
imation.  The  resulting  error  is  more  than  purely 
geometric.  It  also  distorts  the  grey  level  values  at 
each  point.  The  grey  values  depend  on  the  angle 
between  the  surface  normal  and  the  direction  of  the 
emitted  light.  If  this  direction  is  distorted  by  assum¬ 
ing  it  is  always  the  same  direction,  when  in  reality 
it  is  not,  then  the  grey  values  will  be  distorted.  At¬ 
tempts  at  providing  a  non-orthographic  treatment 
have  been  made  (e.g.  [6])  but  they  involved  unnat¬ 
ural  (“stereographic”)  projections  and  singularities, 
and  had  problems  with  integrability.  Here  we  pro¬ 
vide  a  full  3D  perspective  treatment  of  the  shape- 
from-shading  and  similar  problems.  Our  viewer  is 
“isotropic” :  there  is  no  particular  viewer  direction. 

Another  problem  is  that  the  usual  formalism  makes 
it  hard  to  reconstruct  the  surface  without  knowl¬ 
edge  of  some  special  points  or  curves  —  for  example, 
points  of  maximal  brightness  [4]  or  occluding  con¬ 
tours  [6].  However,  such  extremal  points  and  curves 
are  usually  very  hard  to  measure  accurately.  The  im¬ 
age  in  a  neighborhood  around  a  maximal-brightness 
point  is  usually  quite  saturated,  making  the  grey  lev¬ 
els  very  unreliable.  The  grey  levels  near  an  occluding 
contour  are  also  hard  to  measure  accurately. 

If  we  want  to  recover  a  surface  from  a  patch  of 
grey  level  values,  without  resort  to  extremal  quanti¬ 
ties,  we  need  some  additional  knowledge  about  the 
surface.  Several  simple  modeling  assumptions  have 
been  tried.  Some  examples  are  the  assumption  of  a 
locally  spherical  surface  [7,  10],  a  locally  cylindrical 
surface  [19],  or  a  surface  with  umbilical  points  every¬ 
where  [2].  However,  these  assumptions  are  either  too 
restrictive,  i.e.  they  make  too  strong  an  assumption 
about  the  surface,  or  not  restrictive  enough,  thus 
not  allowing  us  to  find  the  surface  uniquely.  It  is 
hard  to  find  a  modeling  assumption  which  is  “just 
right”  in  this  sense,  or  even  to  determine  what  is 
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just  right,  namely  how  much  information  we  need 
to  add  to  the  given  shading  information  in  order  to 
recover  the  surface. 

Our  new  formalism  makes  it  easier  to  understand 
how  much  information  is  needed  in  addition  to  the 
shading  and  add  it  in  a  natural  way.  Because  of  the 
3D  structure  of  the  formalism,  we  can  add  a  model¬ 
ing  assumption  that  is  most  naturally  expressed  in 
a  3D  representation.  We  do  not  have  a  preferred 
viewer  direction  or  a  preferred  z  axis;  thus  we  can 
say  things  about  the  surface  which  are  independent 
of  such  preferences.  In  one  example  that  we  use 
here,  the  surface  is  assumed  to  be  locally  quadric. 
This  is  equivalent  to  a  smoothing  assumption  on  the 
surface.  With  this  assumption  we  are  able  to  re¬ 
cover  a  non-Lambertian  surface,  in  perspective  pro¬ 
jection,  without  the  use  of  extremal  points  or  curves. 
We  do  assume  that  the  reflectance  function  is  given, 
but  there  are  no  restrictions  on  this  function.  The 
method  is  non-iterative  and  there  are  no  convergence 
problems. 

Our  new  formalism  is  based  on  concepts  from  math¬ 
ematical  physics,  mainly  Hamilton’s  equation  and 
its  invariants.  In  [16,  17]  we  introduced  the  ap¬ 
plication  of  physics-like  invariants  to  physical  imag¬ 
ing  processes.  Here  we  make  use  of  these  concepts 
and  extend  the  Hamiltonian  formalism  to  3D.  The 
Hamiltonian  formalism  was  also  used  for  shading  in¬ 
dependently  in  [4]  and  [11],  but  without  the  use  of 
invariance. 

Invariance  is  important  in  vision  in  two  aspects: 
the  geometrical  aspect  and  the  physical  one.  While 
geometric  invariants  have  been  studied  extensively 
lately  (e.g.  [9,  13,  14,  15,  18]),  the  physical  ones  still 
need  to  be  fully  exploited.  Physics-like  invariance  is 
the  analogy  of  the  conservation  of  energy,  momen¬ 
tum,  etc.  which  are  invariants  of  the  laws  of  nature. 
The  analogy  can  be  very  useful  in  vision. 

2  Shading — The  Physics  Analogy 

We  use  the  shape-from-shading  problem  as  the 
most  difficult  example  that  demonstrates  the  use  of 
physics-like  invariant  methods.  Other  processes  can 
be  treated  more  simply. 

In  the  shape-from-shading  problem,  we  have  data 
in  the  form  of  the  image  brightness  E{x,y),  from 
which  a  surface  z{x,y)  has  to  be  recovered.  This 
recovery  depends  on  many  unknowns,  such  as  the 
surface  function  itself,  the  surface  reflectance,  the 
lighting,  etc.  As  a  way  of  simplifying  the  problem, 
the  “reflectance  map”  was  introduced  by  Horn,  e.g. 
[5].  This  is  a  function  R{p,q)  that  represents  the 
amount  of  light  emitted  in  the  viewer  (z)  direction 
from  a  surface  element  with  slope  components  p,q, 
in  an  orthographic  projection.  It  contains  only  the 


two  unknowns  p,  q  which  represent  the  slopes  of  the 
surface  in  the  x  and  y  directions.  It  is  assumed  that 
this  function  is  known,  i.e.  that  the  light  distribu¬ 
tion  and  other  factors  are  already  built  into  R,  and 
the  only  unknowns  are  the  x  and  y  slopes  of  the  ob¬ 
ject  at  each  point  x,y.  The  light  emitted  from  the 
surface  element  in  the  z  direction  falls  on  the  image 
at  point  x,  y,  and  gives  rise  to  the  image  brightness 
E{x,y)  at  that  point.  To  a  good  approximation, 
this  image  brightness  is  proportional  to  the  amount 
of  light  R{p,  q)  that  was  emitted.  Thus  we  can  write 
the  “image  irradiance  equation” 

E{x,y)  =  R{p,q)  (1) 

with  the  proportionality  constant  absorbed  into  E. 
This  deceptively  simple  looking  equation  is  in  fact 
a  complicated  partial  differential  equation,  because 
p,q  are  the  derivatives  of  the  surface  height  z(x,y) 
with  respect  to  x,y:  p  =  dz/dx,  q  =  dz/dy.  Our 
goal  is  to  solve  this  equation  for  the  surface  function 
z{x,  y),  given  the  brightness  E{x,  y). 

The  irradiance  equation  can  be  solved  by  the  method 
of  characteristics,  which  is  quite  commonly  used  for 
first-order  partial  differential  equations.  We  will 
briefly  describe  this  method.  In  ordinary  differential 
equations,  one  can  start  from  a  known  initial  point 
with  given  initial  conditions,  and  propagate  the  so¬ 
lution  from  that  point  in  infinitesimal  steps,  using 
the  given  equation.  In  partial  differential  equations, 
we  need  an  initial  curve  rather  than  a  point.  From 
each  point  on  this  initial  curve  we  grow  a  solution 
curve,  the  “characteristic”  x{t),y(t),z{t),  going  in 
steps  of  some  parameter  t  (Figure  1).  This  charac¬ 
teristic  is  a  solution  of  the  equation  along  one  curve. 
The  collection  of  all  characteristics,  started  from  all 
points  on  the  initial  curve,  constitutes  the  solution 
on  the  whole  domain. 

In  most  cases  (including  ours)  we  need  the  tangents 
to  the  characteristic  curves  as  part  of  the  solution, 
and  thus  the  quantities  p{t),q{t)  are  added.  The 
term  “characteristic  strips”  rather  than  curves  is  of¬ 
ten  used  for  this  situation. 

In  order  to  grow  the  characteristics  for  the  irradiance 
equation  (1),  it  has  to  be  brought  into  a  suitable 
form.  This  is  done  in  [5].  We  will  later  derive  a 
generalized  version.  One  obtains  the  four  first-order 
coupled  differential  equations 

X  =  Rp  y  =  Rq 

p=  Ex  q  =  Ey  (2) 

where  the  dot  means  differentiation  with  respect  to 
the  curve  parameter  t  and  the  subscripts  indicate 
differentiation. 
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Figure  1:  Method  of  characteristics 


To  grow  a  characteristic  we  proceed  as  follows.  We 
start  from  a  point  Xo ,  yo  on  the  given  initial  curve, 
with  known  slopes  po,qo.  We  substitute  these  values 
in  the  rhs  of  the  above  four  equations,  thus  obtaining 
the  values  of  the  four  derivatives  x,y,p,q.  We  can 
then  advance  in  a  step  of  a  small  size  At  to  a  new 
point  having  coordinates  and  slopes  xi  =  xq  +  xAt, 
etc.  From  this  point  we  can  iterate  the  process  to 
form  a  whole  characteristic  curve.  It  remains  to  find 
z{t),  which  can  be  done  by  integrating  the  equation 

i  =  pRp  +  qRg  (3) 

in  which  the  rhs  is  known. 

The  problem  remains  of  finding  the  initial  condi¬ 
tions.  We  can  define  an  arbitrary  initial  curve 
r(x,2/),  but  we  also  need  the  values  of  p,q  at  each 
point  on  it.  Thus  we  need  some  additional  con¬ 
straints,  and  there  is  no  general  way  of  obtaining 
them. 

Common  ways  to  deal  with  the  problems  are:  i)  Us¬ 
ing  points  of  maximal  brightness  (e.g.  [4]).  These 
are  points  where  characteristics  converge,  so  one 
needs  initial  conditions  only  at  these  points.  How¬ 
ever,  the  shading  information  around  these  points  is 
very  unstable,  leading  to  unreliable  characteristics 
[12].  ii)  Using  occluding  boundaries  (e.g.  [6]).  Here 
we  know  that  the  normal  is  tangent  to  the  viewer’s 
line  of  sight  and  that  fact  can  be  used  as  initial  condi¬ 
tions.  However,  the  shading  information  around  an 
occluding  boundary  is  also  unstable,  and  the  bound¬ 
ary  can  be  occluded. 

A  way  to  overcome  these  problems  is  to  use  model- 
based  knowledge  about  the  surface  in  the  process  of 
recovering  the  surface.  Our  formalism  makes  it  easy 
to  do  that  because  it  treats  the  surface  in  a  full  3D 
description  rather  than  separating  the  depth  infor¬ 
mation.  For  example,  approximating  the  surface  as 


locally  quadric  becomes  quite  natural  using  the  or¬ 
dinary  3D  representation  of  a  quadric.  In  addition, 
invariance  properties  such  as  scale  or  rotational  in¬ 
variance  are  a  useful  part  of  the  formalism. 

For  our  subsequent  treatment  we  now  recast  the  four 
equations  (2)  in  a  different  form  which  is  more  easily 
handled  by  the  calculus  of  variations.  We  define  the 
Hamiltonian  function  H  as 

H  =  R-E 

In  this  formulation,  the  irradiance  equation  (1)  is 
H  —  Q,  while  the  characteristic  equations  (2)  can 
now  be  written  in  the  form  of  Hamilton’s  equations 


.  dH 

dH 

(4) 

II 

^  1 

dH 

dH 

(5) 

dy 

This  form  is  similar  to  the  one  used  to  describe 
physical  processes,  and  makes  it  possible  to  use  the 
available  knowledge  in  mathematical  physics  about 
invariants  of  such  processes.  It  should  be  noted, 
however,  that  the  similarity  is  only  an  analogy  here, 
since  our  p,  q  are  purely  geometrical  entities  (slopes) 
while  in  physics  they  represent  momentum.  Our 
problem  is  in  fact  more  general  than  the  physics 
problem  because  there  the  kinetic  energy  is  limited 
to  the  simple  form  p^/2to. 

Continuing  the  physics  analogy,  the  characteristics 
can  be  thought  of  as  the  trajectories  of  particles  or 
fluid  elements  moving  in  time  t  with  velocity  x  and 
acceleration  p.  H  represents  the  particle’s  “energy”, 
comprised  of  kinetic  energy  R  and  potential  energy 
—E.  The  equation  H  =  0  above  means  that  the  to¬ 
tal  energy  is  constant  along  the  trajectory.  Looking 
at  Hamilton’s  equation  above,  we  can  see  that  an  ar¬ 
bitrary  constant  can  be  added  to  the  energy  on  each 
trajectory,  since  it  will  vanish  in  the  differentiation 
in  the  rhs. 

Beside  finding  invariants,  another  advantage  of  the 
Hamiltonian  representation  is  that  it  enables  us  to 
deal  with  a  more  general  irradiance  equation  than 
the  one  given  in  (1).  This  will  enable  us  to  deal 
with  perspective  projection  and  with  non-shading 
physics-like  processes.  We  have  generalized  the 
treatment  in  two  ways  [16,  17]: 

(i)  There  is  no  need  to  separate  the  variables  be¬ 
tween  two  different  functions  R{p,q),  E{x,y). 
Rather,  we  have  one  function  H(x,y,p,q)  that 
depends  on  all  four  variables  in  a  general  way. 
In  our  shading  example  we  will  use  a  reflectance 
function  R  that  depends  on  x,  j/  as  well  as  p,q. 
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tion  can  be  written  as 


Table  1:  Relation  between  symmetries  and  invari¬ 
ants. 


Symmetry  under... 

conservation  law 

Translation  in  time 

energy 

Translation  in  space 

momentum 

Rotation 

angular  momentum 

Lorentz  (space/time) 

E  =  mc^ 

Projective 

Scale 

scale  momentum  (?) 

(ii)  The  variables  p,q  do  not  have  to  be  slopes. 
They  can  be  any  variables  for  which  the  Hamil¬ 
ton  equations  make  sense,  and  there  can  be 
many  physics  and  physics-like  processes  that 
satisfy  this  equation.  In  the  shading  exam¬ 
ple  they  can  be  angles  or  3D  vectors.  This  is 
valuable  because  the  reflectance  laws  are  sim¬ 
pler  (and  more  evidently  symmetric)  in  terms 
of  such  variables. 

3  Invariants  and  Symmetries 

In  this  section  we  summarize  the  relation  between 
symmetries  of  the  physical  laws  and  invariants,  as  it 
applies  to  shading  (or  shading-like)  processes.  Our 
full  development  is  given  in  [16,  17]. 

In  the  classical  example,  energy  and  momentum  are 
invariants  of  the  equations  of  motion,  namely  they 
are  constant  as  the  particle  moves  along  a  trajec¬ 
tory.  Energy  conservation  results  from  symmetry  of 
the  equations  of  motion  with  respect  to  translation 
in  time,  while  momentum  conservation  results  from 
symmetry  of  the  motion  equations  with  respect  to 
translation  in  space.  Similarly,  conservation  of  angu¬ 
lar  momentum  results  from  a  rotational  symmetry  of 
the  equations.  By  symmetry  we  mean  that  the  form 
of  the  equation  is  invariant  under  the  transforma¬ 
tion.  The  solution  is  not  generally  symmetric  and 
depends  on  the  initial  conditions.  In  other  words, 
symmetry  and  invariance  here  are  properties  of  the 
basic  physical  laws  and  not  of  any  particular  situa¬ 
tion.  Table  1  summarizes  various  symmetries  used 
in  physics  and  their  corresponding  invariants. 

The  main  result  concerning  the  relation  between  in¬ 
variants  and  symmetries  is  Noether’s  theorem  [8]. 
We  give  here  only  the  result  as  it  applies  to  the  shad¬ 
ing  problem. 

We  deal  with  a  general  transformation  of  the  coor¬ 
dinates  and  t,  having  r  parameters  s  =  1, . .  .,r. 
For  example,  the  rotation  group  in  the  plane  has  one 
parameter,  the  angle  d.  This  coordinate  transforma¬ 


=  x‘  +  Cidw^ ,  t  =  t  +  (6) 

with  Ci ,  6  being  coefficients  characterizing  the  trans¬ 
formation.  They  generally  depend  on  x\t.  The 
summation  convention  is  used.  (In  the  2D  case  x*  is 

^,y-) 

If  the  reflectance  function  R  is  symmetric  with 
respect  to  a  transformation  of  the  above  type, 
Noether’s  theorem  leads  to  the  following  “conser¬ 
vation  law”,  as  we  showed  in  [16,  17]: 

dw> 


The  rhs  is  a  known  quantity.  In  the  physical  anal¬ 
ogy,  the  quantity  piQ  in  the  Ihs  is  the  invariant  such 
as  energy  or  momentum,  while  the  rhs  is  related  to 
an  external  force.  In  the  momentum  case  the  above 
equation  means  that  the  change  in  momentum  over 
time  is  proportional  to  the  external  force.  Momen¬ 
tum  is  conserved  in  the  absence  of  an  external  force. 
The  rhs  can  be  written  more  explicitly  in  terms  of 
the  coordinates.  For  a  general  function  of  x*  we  can 
write,  using  the  chain  rule, 


(7) 


Thus  the  conservation  law  becomes 


staC)  =  (*) 

An  interesting  property  here  is  that  we  do  not  need 
to  know  the  exact  details  of  R  to  find  the  forms 
of  various  invariants,  as  R  does  not  appear  in  the 
invariant  equations  above.  The  symmetry  properties 
of  R  are  sufficient. 

It  is  clear  that  it  is  desirable  to  find  transforma¬ 
tions  under  which  the  reflectance  function  R,  or  the 
physical  law  that  it  describes,  is  symmetric.  The 
coefficients  Cs  of  this  transformation  will  then  be 
substituted  into  the  conservation  law  (8)  to  find 
the  invariant  constraints.  In  [16,  17]  we  have  done 
this  for  rotation  and  translation.  Here  we  do  it  for 
the  scaling  transformation,  which  is  the  most  im¬ 
portant  invariance  in  the  shading  case,  because  the 
reflectance  function  depends  only  on  angles  and  not 
on  scale.  It  has  no  physical  analog  because  physical 
laws  are  not  normally  scale-invariant. 

Scale  invariance 

The  scaling  transformation  x’  =  wx®  can  be  written, 
with  a  scaling  parameter  w,  as 

X®  =  x®(l  +  dw) 

so  that 

c*  =  s' 
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A  scale  change  can  be  written,  from  (7),  as 


—  =  —x^ 
dw  dx’’ 


with  the  camera  axis  (as  long  as  it  passes  through 
the  origin).  Among  other  advantages,  this  can  sim¬ 
plify  the  handling  of  singularities  near  the  limb  of 
the  surface. 


Thus  the  conservation  law  (8)  becomes 

(10) 

The  rhs  represents  the  scaling  transformation  of  E, 
due  to  (9).  We  will  later  use  a  formulation  in  which 
E  is  scale-invariant  and  thus  the  rhs  will  vanish. 
Thus,  the  quantity  piX^  can  perhaps  be  called  the 
“scale  momentum”  or  “magnification  momentum”. 
We  will  later  expand  on  the  use  and  geometrical  sig¬ 
nificance  of  the  above  equations. 


The  surface  will  also  be  represented  in  a  more 
isotropic  way.  Instead  of  z(x,  y)  we  describe  the  sur¬ 
face  as  an  implicit  function  f{x,  y,z)  =  Q.  At  each 
point  of  the  surface  we  can  define  a  vector  perpen¬ 
dicular  to  it  by 

^=§^  =  {P,Q,U)  (11) 

The  perpendicularity  can  be  seen  by  writing 

d/  =  P  •  dx  =  0.  (12) 


4  Non-Lambertian  Surface  in 
Perspective  Projection 

4.1  The  isotropic  viewer 

Our  formalism  makes  it  possible  to  deal  with  shad¬ 
ing  in  a  perspective  projection  rather  than  the  usual 
orthographic  one.  In  fact,  for  a  non-Lambertian  sur¬ 
face,  the  treatment  is  easier  and  more  natural  in 
perspective  projection. 

In  perspective  projection,  we  can  write  the  Hamil¬ 
tonian  R  —  E  in  a.  way  which  is  independent  of  the 
viewing  direction,  or  the  optical  axis  of  the  camera. 
This  can  be  called  an  “isotropic”  viewer.  This  is 
possible  and  desirable  since  the  image  brightness  is 
basically  a  function  of  angles  between  the  light  rays 
and  the  surface  normal,  not  of  the  optical  axis.  The 
physical  construction  of  the  camera  does  not  need 
to  be  isotropic,  i.e.  the  image  sensor  can  remain 
flat.  However  R  —  E  can  be  represented  as  a  func¬ 
tion  of  angles  which  are  independent  of  the  optical 
axis.  There  is  no  “viewer’s  direction”  here.  This  is 
unlike  the  usual  orthographic  projection  which  has 
a  preferred  direction. 

Thus  we  can  write  the  brightness,  using  the  3D  vec¬ 
tor  X  =  (x,  y,  z),  as 


E{^^)  =  ^(x) 

i.e.  the  brightness  at  any  point  on  the  image  de¬ 
pends  on  a  3D  unit  vector  x  pointing  from  the  ori¬ 
gin  (chosen  as  the  camera  optical  center)  towards 
the  image.  In  this  formulation,  there  is  no  need  to 
distinguish  between  “world  coordinates”  and  “cam¬ 
era  coordinates”  as  in  most  other  formulations,  x  is 
the  same  on  the  image  and  on  the  object  surface,  for 
points  connected  by  the  same  ray  of  light  (Figure  2). 
The  usual  function  E{x,  y)  can  easily  be  remapped 
into  a  function  £'(x).  Here  the  choice  of  the  z-axis 
direction  is  arbitrary  and  does  not  have  to  coincide 


When  X,  dx  are  on  the  surface  then  df  =  0  and  P, 
dx  are  perpendicular. 

From  this  it  is  easy  to  calculate  the  surface  gradient 

^__P_  _9_- 

dx~  U~^'  dy~~U~^ 


A  unit  normal  to  the  surface  can  now  be  defined  as 

,_P_  P,Q,U  _  -p-q,l 
|P|  +  Q2  +  [/2  ^p2  +  g2  q.  1 


/,  and  thus  P,  are  determined  here  only  up  to  a  mul¬ 
tiplicative  factor  g  (not  necessarily  constant).  This 
factor  is  eliminated  from  the  gradient  and  normal 
expressions  above.  The  characteristic  equations  will 
determine  g  up  to  the  initial  conditions. 

The  Hamilton  equations  (4), (5)  with  our  generalized 
treatment  (end  of  Section  2)  are  easily  extended  to 
the  3D  isotropic  case.  We  now  write 


•  -  ^  •  _  dH 

*  5P ’  dx 


(14) 


It  is  easily  proved  that  equation  (3)  for  z,  which  was 
previously  an  extra  equation,  now  comes  out  natu¬ 
rally  from  the  3D  Hamilton  equations.  This  is  true 
for  any  irradiance  function  that  depends  on  the  nor¬ 
mal  n  rather  than  on  P.  For  any  function  R{n,  •  •  ■) 
we  can  write,  from  (14), 

5P^Vl  \P\dii  ^dh 

as  can  be  verified  componentwise.  Multiplying  by  P 
we  obtain 

P-k  =  0  (15) 

The  geometrical  meaning  of  this  equation  is  that  the 
direction  of  a  characteristic  in  3D  is  perpendicular 
to  the  surface  normal.  In  other  words,  the  charac¬ 
teristic  stays  on  the  surface.  Dividing  P  -  x  =  0  by  17 
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we  obtain  (3),  which  was  previously  derived  geomet¬ 
rically  and  now  follows  from  Hamilton’s  equations. 

The  “physical”  explanation  is  that  because  the  “mo¬ 
mentum”  component  perpendicular  to  the  surface 
always  equals  1  inside  R,  R  cannot  express  changes 
of  momentum  P  which  are  perpendicular  to  the  sur¬ 
face;  only  changes  tangent  to  the  surface  are  mean¬ 
ingful.  Since  momentum  change  is  proportional  to 
force,  the  “force”  driving  the  “particles”  along  char¬ 
acteristics  is  always  tangent  to  the  surface. 

Eq.  (15)  has  an  important  relation  to  the  integra- 
bility  of  the  system.  From  (12)  we  can  see  that  (15) 
is  equivalent  to  df/dt  =  0.  Thus  (15)  is  a  neces¬ 
sary  condition  for  integrability  to  a  surface,  i.e.  for 
obtaining  a  surface  function  f{x,  y,  z)  =  0. 

Another  necessary  condition  is  that  an  equation  sim¬ 
ilar  to  (15)  holds  perpendicular  to  the  characteristic, 
while  (15)  holds  along  it.  This  can  be  written,  with 
a  parameter  s  going  across  a  characteristic,  as 


The  Hamiltonian  can  be  written  as 


H  =  i?(n  ■  s,  n  •  x)  —  E(x)  =  0  (17) 


The  term  n  x  in  i?  represents  the  light  emitted  from 
the  surface.  It  means  that  the  emitted  light  intensity 
is  a  function  of  the  angle  between  the  surface  normal 
n  and  the  direction  x  of  a  light  ray  in  perspective 
projection  (Figure  2).  This  is  in  accordance  with  the 
general  law  of  reflectance. 

Since  R  now  depends  on  x,  we  can  no  longer  use 
the  simple  shading  equations  (2);  the  more  general 
Hamilton  equations  (14)  have  to  be  used.  This  is 
an  advantage  of  the  Hamiltonian  formalism  and  the 
generalized  treatment  at  the  end  of  Section  2. 

The  Hamilton  equations  can  be  developed,  using  the 
identity 


Oil 

dP 


I  P*P\ 
|P|  |P|3  J 


1^(7  -  n  *  n) 


(16) 


This  equation  is  independent  of  the  Hamilton  equa¬ 
tions.  We  will  use  it  to  close  a  system  of  equations 
for  moving  across  characteristics. 


4.2  Generalized  reflectance  function 

With  the  above  formalism  it  is  easy  to  treat  a  general 
surface  under  perspective  projection.  For  simplicity, 
we  carry  out  the  treatment  for  the  case  in  which  the 
light  source  is  rotationally  symmetric  around  the  di¬ 
rection  s.  However,  this  restriction  is  immaterial  to 
the  development  and  we  will  later  show  how  to  re¬ 
move  it.  Another  simplification  to  be  removed  later 
is  the  assumption  of  constant  albedo. 


where  I  is  the  unit  matrix  and  *  denotes  the  “outer 
product”  of  vectors.  Similarly  for  x.  We  obtain 


X 


m  _  OR 
dP  d{s  ■  n) 
dR  1 
^  5(x-n)  |P| 


(s  -  (s  •  n)n) 
(x  -  (x  •  n)n) 


(18) 


P  =  - 


d{R  -E)  _ 
dx 
dE 


dR 


J_ 

lx|  V  dx 


d(x  ■  n)  |x 
dE 
d^ 


(n  -  (x  •  n)x) 


X  X 


(19) 

When  the  reflectance  depends  on  the  surface  albedo, 
than  R  has  an  explicit  dependence  on  x: 
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R  =  R{s  •  n,  X  •  n,  x) 


In  this  case  the  first  Hamilton  equation  above  is  un¬ 
changed.  In  the  second  equation,  E  is  replaced  by 
E-R. 

4.3  Invariance  properties 

The  fundamental  invariance  property  of  the  above 
equations  is  scale-invariance.  This  holds  because 
both  the  R  and  E  parts  of  the  Hamiltonian  depend 
only  on  angles  and  not  on  distances.  This  is  unlike 
most  physical  laws. 

The  brightness  E  is  now  scale-invariant,  so  by  (9) 
we  can  write 

dE  dE  „ 

“  atn  ~  ° 

with  w  being  a  scaling  parameter.  Thus  the  scale- 
invariance  equation  (10)  simplifies  to 


4.4  Moving  across  a  characteristic 

Here  we  develop  the  equations  necessary  to  move 
across  a  characteristic  in  our  3D  formalism.  We  deal 
with  a  general  surface.  In  the  next  section  we  will 
close  the  system  of  equations  for  a  locally  quadric 
surface. 

We  assume  that  the  values  of  x,  P  are  known  at 
some  initial  point.  We  denote  by  t,  s  the  parame¬ 
ters  along  and  across  a  characteristic,  respectively. 
These  parameters  are  defined  on  the  3D  surface,  not 
in  the  2D  image  projection.  We  will  denote  deriva¬ 
tives  wrt  the  parameters  by  subscripts,  e.g.  X(,Xa. 
(The  f-derivatives  are  the  same  as  the  dot  derivatives 
above). 

Prom  the  condition  that  Xs,Xt  are  tangent  to  the 
surface  (15), (16)  we  have 


XfP  =  0 

(22) 

Xs  -  P  -Q 

(23) 

I.e.,  the  “scale  momentum”  x  •  P  is  conserved  along 
characteristics.  This  can  also  be  proved  directly 
from  Hamilton’s  equations.  The  scale-invariance  of 
this  momentum  can  be  seen  from  the  definition  of 
P  as  a  gradient  df/dx  (13).  When  the  coordinates 
are  scaled  by  w,  the  momentum  is  scaled  by  l/w, 
keeping  the  product  x  •  P  invariant. 

In  most  cases  it  can  be  assumed  that  the  charac¬ 
teristics  in  some  region  of  the  image  meet  at  some 
point,  such  as  a  point  of  maximal  brightness.  (For 
our  purposes  we  do  not  need  to  know  the  location  of 
that  point.)  Even  if  two  characteristics  do  not  meet, 
they  can  still  be  assigned  the  same  scale  momentum, 
because  in  this  case  we  are  free  to  adjust  P  at  the 
initial  points  by  an  arbitrary  multiplicative  factor. 
Thus  all  characteristics  start  with  the  same  amount 
of  scale  momentum  and  conserve  the  same  momen¬ 
tum.  Therefore  the  scale  momentum  is  conserved 
across  characteristics  as  well  as  along  them: 

^(x  •  P)  =  0  (20) 


Subtracting  (16)  from  (20)  we  obtain 


ds 


•x  =  0 


(21) 


The  analogous  equation  for  the  t-direction  can  be 
proved  in  the  same  way,  and  also  directly  by  multi¬ 
plying  (19)  scalarly  by  x. 

All  the  above  equations  are  independent  of  any  ro¬ 
tational  or  translational  symmetries  that  the  system 
may  have,  and  thus  the  system  does  not  have  to  pos¬ 
sess  these  symmetries  in  order  for  us  to  move  across 
characteristics. 


These  two  equations  are  equivalent  to  /*=/<  =  0, 
and  ensure  the  integrability  of  the  system.  The  last 
one  is  independent  of  the  Hamilton  equations  and 
will  enable  us  to  close  the  system  of  equations.  Next, 
we  define  the  arc  length  along  s  by 


X,  •  Xj  =  Xj  •  Xj  (24) 

This  definition  is  quite  arbitrary,  but  it  is  reason¬ 
able  to  choose  Xf  so  it  will  scale  the  same  way  as 
x*.  This  equation  holds  only  along  the  initial  curve 
which  is  parametrized  by  s.  On  subsequent  s-curves 
we  will  not  be  free  to  define  the  arclength;  the  Eu¬ 
clidean  distance  between  characteristics  changes  as 
we  go  along  them  even  though  their  distance  in  the 
s  parameter  remains  constant. 

Another  useful  dot  product  is  derived  by  multiplying 
(18)  by  Xt  and  using  (22): 


xt  •  xt  =  F  •  Xi  (25) 


dR  I  .  dR  1  „ 

9(s  •  n)  |P|*  d{x  ■  n)  |P|^ 

Another  relation  can  be  easily  obtained  from  (18): 

BR 

s  •  X(  X  P  =  — — s  •  X  X  n  =  G  (26) 

a(x  ■  n) 

The  rhs  vanishes  for  a  Lambertian  or  for  a  circular 
surface. 

To  find  more  equations  we  take  derivatives  of  the 
above  equations.  Differentiating  (23)  wrt  t  we  obtain 

Xjt  •  P  -b  X,  •  Pt  =  0  (27) 
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Next  we  differentiate  (22),  (25),  (26)  wrt  s  to  obtain 
x,fP  +  x, -P,  =0  (28) 

2xt  •  Xjt  =  F  •  Xst  +  F,  •  xt  (29) 

s  •  x,j  X  P  +  s  •  xt  X  P,  =  Gj  (30) 

with 

^  d^R  s-n, .  d^R  Xj-n  +  x-n,^ 
®'’“a(s-n)2  |P|  ®+(x-n)2  |P| 

dR  1  .  P-P^r' 

+  5(x-n)|P|'''  IPP 

Gs  =X7r— r^s  •  (x  X  n)(x,  •  n  +  x  •  n,) 
d{x  ■  n)‘‘ 

dR  ^  , 

+  a/  -  ■^s-(x,  xn  +  xxn,) 
a(x  •  n) 

and 

.  P,  (P-P,)P  _  X.  (x-x,)x 

“»-|P|  |P|3  >  ^  |x|  |xP 

(31) 

Finally  we  rewrite  (21): 

P.  •  X  =  0  (32) 

The  above  seven  equations  (23),(24),(27)-(30),  (32) 
form  a  system  of  equations  for  the  nine  components 
ofxi,x,t,Pj.  Eqs.  (28), (29),(30)  are  equivalent  to 
the  s-derivative  of  (18),  but  they  are  simpler.  The 
last  five  equations  have  the  form  of  a  scalar  product, 
which  is  explicitly  rotationally  invariant. 

To  close  the  system  of  equations  we  need  to  make 
some  assumption  about  the  surface.  This  is  done  is 
the  next  section. 

We  will  mention  for  completeness  the  relation  be¬ 
tween  the  tangency  conditions  (22), (23)  (related  to 
the  scale  invariance)  and  the  irradiance  equation 
(17): 

dH 

(xt  •  P),  -  (x,  ■  P)t  =  Xt  •  P,  -  X,  •  Pt  =  -^  =  0 

This  equation  is  independent  of  Hamilton’s  equa¬ 
tions  (which  leave  us  with  an  integration  constant 
which  depends  on  s)  but  it  does  follow  from  the  ir¬ 
radiance  equation. 

4.5  A  locally  quadric  surface 

Here  we  assume  that  the  surface  is  locally  quadric. 
This  means  that  third-order  derivatives  are  assumed 
to  be  relatively  small  at  any  point. 

Quadric  polynomials  were  used  in  [3].  There  it  was 
the  brightness  that  was  approximated  by  quadrics 
rather  than  the  surface  itself  (as  in  our  case),  and  a 
segmentation  of  the  surface  was  needed. 


A  quadric  surface  can  be  described  using  a  3  x  3 
symmetric  matrix  A  as 

/  =  g[{x  -  xo)A(x  -  xo)  -b  G]  =  0 

where  xq  is  the  center,  G  is  a  constant,  and  g{x)  7^  0 
is  some  arbitrary  function.  By  the  definition  of  the 
momentum  we  have 

P  =  =  2gA{x  -  Xq) 

Multiplying  by  x,  we  see  that  the  scale  momentum 
X  •  P  is  a  function  of  g.  The  Hamilton  equation  will 
set  g  so  that  the  scale  momentum  is  constant.  For 
a  central  quadric,  namely  xq  =  0,  it  is  easily  shown 
that  constant  scale  momentum  yields  a  constant  g. 

A  central  conic  is  sufficient  for  our  purposes.  This  is 
because,  when  observed  from  any  direction  z,  a  gen¬ 
eral  quadric  surface  z{x,  y)  is  indistinguishable  from 
a  central  quadric  up  to  second  derivatives.  I.e.,  the 
derivatives  Zx ,  Zy ,  Zxx »  ^xy  i  ^yy  can  be  set  to  any  given 
values,  for  any  xq  .  (This  is  to  be  distinguished  from 
the  derivatives  of  /,  of  which  there  are  nine  up  to 
second  order.)  Thus  having  the  center  at  the  origin 
(the  optical  center)  is  not  a  significant  restriction. 
Of  course,  the  center  does  not  need  to  be  “inside” 
a  convex  object.  For  example,  an  object  can  be  ap¬ 
proximated  locally  as  a  hyperboloid,  whose  center  is 
“outside”  the  body  and  lies  at  the  origin. 

We  can  thus  simply  set  5  =  1/2  and  obtain 
P  =  Ax 

We  are  interested  in  finding  Xs,Pj.  Differentiating 
wrt  s  we  obtain  three  simple  relations  between  the 
vector  components: 

P,  =  Ax,  (33) 

However,  two  of  these  relations  follow  from  the  gen¬ 
eral  equations  in  the  previous  subsection.  Multiply¬ 
ing  by  X  we  get 

X  •  P,  =  xAx,  =  P  ■  X, 

and  similarly 

x^  •  P,  —  P t  *  x. 

Both  of  these  equations  are  general  to  any  surface 
and  follow  from  the  previous  derivations.  A  third 
equation  is  obtained  by  multiplying  (33)  by  xu- 

Xu  ■  P,  =  XuAxs  =  Pii  •  X,  (34) 

This  does  not  follow  from  previous  equations  and  it 
characterizes  the  central  quadric.  We  can  see  that 
this  relation  contains  second-order  derivatives  of  E 
appearing  in  Pit,  unlike  all  previous  equations. 
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Another  equation  can  be  obtained  by  differentiating 
(33)  wrt  t  and  multiplying  by  Xj* : 


•  Pi,<  =  Ptt  ■  (35) 

Here  Pgt  can  be  written  as  a  function  of  Xj,Ps  by 
differentiating  Hamilton’s  equation  (19)  wrt  s.  Thus 
we  have  two  more  equations  (34), (35)  to  close  the 
system  we  had  before. 

In  summary,  we  have  obtained  nine  linear  equations 
(23),(24),(27)-(30),  (32),(34),(35)  for  the  nine  un¬ 
knowns  Xj,Ps,Xst.  It  should  be  noted  that  these 
nine  equations  have  to  be  solved  at  every  point  of 
the  initial  curve  only,  not  at  every  point  of  the  sur¬ 
face. 

It  remains  to  find  the  “curvatures”  x,j.  Differenti¬ 
ating  (23), (24)  wrt  s  we  get 

Xjs  ■  P  +  Xj  •  Pj  =  0 

Xss  ■  Xj  =:  Xj  ■  X„( 

It  is  also  easy  to  prove  from  the  general  equations  of 
the  previous  section  that 

Xj  Pj  =  X  •  P„ 

x,  Ps  =  x,,  P 

Xt  ■  Pj,  -I-  Xst  •  P«  =  X,  •  Pst  +  Xss  •  P« 

These  five  equations  are  independent  of  any  par¬ 
ticular  assumption  such  as  the  quadric  assumption. 
Since  we  have  six  unknowns  Xss,Ps3.  we  need  an 
additional  equation.  Eq.  (33)  can  be  differentiated 
with  respect  to  s  and  multiplied  by  Xu ,  which  yields 
one  relation  independent  of  the  above: 

Xtt  *  Ps5  —  P<t  ‘  Xss 

We  thus  have  six  linear  equations  for  Xss ,  Pss  •  Given 
these  quantities,  we  can  proceed  along  the  curve  x(s) 
across  the  characteristic  by  steps  of  second-order  ac¬ 
curacy.  The  accumulation  of  steps  produces  first- 
order  accuracy  of  the  curve  at  its  far  end  from  the 
starting  point.  Again,  we  need  to  do  this  only  on 
the  initial  curve,  not  on  the  whole  surface. 

4.6  General  light  sources  and  albedo 

In  the  previous  section  we  assumed  that  the  light 
source  was  rotationally  symmetric  around  s.  This 
was  done  only  for  simplicity.  We  now  describe  the 
modifications  needed  to  remove  this  restriction. 

For  a  general  light  source  our  generalized  reflectance 
function  can  be  written  as 

R=i  f  r(n  •  s,  n  •  X,  x)ds 


with  the  integration  being  carried  over  all  the  light 
directions  s,  and  with  r  representing  the  intensity  of 
the  light  as  a  function  of  the  direction  s.  The  argu¬ 
ment  X  above  (separately  from  n  •  x)  represents  the 
dependence  of  the  reflectance  on  the  surface  albedo. 

Thus,  in  all  the  equations  above  that  involve  R,  R 
will  be  replaced  by  r  and  the  terms  of  the  equations 
will  be  integrated  over  s.  In  the  case  of  variable 
albedo,  Hamilton’s  equation  (19)  is  also  modified, 
as  mentioned  earlier,  by  replacing  E  with  E  —  R  to 
account  for  the  explicit  dependence  of  R  on  x.  Scale- 
invariance  is  preserved  and  all  the  previous  equa¬ 
tions  are  valid. 

Special  attention  is  needed  to  handle  the  equations 
with  terms  that  vanish  due  to  rotational  symmetry, 
namely  (26)  and  its  s-derivative,  (30).  Multiplying 
Hamilton’s  equation  (18)  vectorially  by  P  and  inte¬ 
grating  we  obtain 


X  X  P  = 


dr 

d{s  ■  n) 


X  n  -b  X  X 


■’/ 


dr 

d{x  ■  n) 


ds 


In  analogy  to  the  symmetric  case,  we  multiply  this 
equation  by  a  weighted  average  of  the  light  direc¬ 
tion,  s*.  This  is  defined  to  be  parallel  to  the  vector 
integral  above: 

f  dr 

s*  X  — rrsds  =  0  (36) 

J  d{s-n) 

This  determines  the  two  free  components  of  the  unit 
vector  s* .  We  obtain  from  the  previous  equation 

s*-xxP  =  G  =  s*-xxn  /  ds 

J  d(x  n) 

This  replaces  (26),  and  its  s-derivative  replaces  (30): 


s*  •  Xit  X  P  -f  s*  •  Xi  X  Ps  -f-  s*  ■  Xt  X  P  =  Gj 


with  s*  being  calculated  from  the  s-derivative  of 
(36). 


5  Conclusions 


We  have  presented  a  general  method  of  dealing  with 
physical  and  physics-like  imaging  processes.  The 
formalism  is  based  on  Hamilton’s  equations,  and  has 
several  advantages: 

i)  Our  formalism  is  fully  3D,  unlike  the  usual  treat¬ 
ment  that  separates  the  depth  from  the  rest  of  the 
variables.  This  makes  it  possible  to  use  perspec¬ 
tive  projection  in  cases  that  were  previously  treated 
only  by  less  general  projections,  e.g.  orthographic 
in  shading,  or  paraperspective  in  texture.  To  do 
this  we  have  introduced  the  “isotropic  viewer”  rep¬ 
resentation  of  a  camera  and  incorporated  it  into  our 
generalized  “reflectance  function”. 
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ii)  Our  formalism  makes  it  possible  to  find  invari¬ 
ants  of  the  imaging  processes.  We  show  how  such 
invariants  can  be  used  in  the  recovery  of  the  surface. 
The  invariants  result  from  symmetries  of  the  imag¬ 
ing  processes.  The  shading  process  is  the  hardest 
one  in  which  to  find  symmetry,  because  it  generally 
has  only  scale  invariance.  Other  processes  such  as  in¬ 
frared  imaging  have  rotational  or  other  symmetries. 
Texture  can  also  be  treated  like  a  shading  problem 
without  a  light  source  [1].  All  these  cases  can  be 
handled  by  our  formalism  as  special  cases. 

iii)  The  3D  formalism  makes  it  easy  to  incorporate 
model-based  geometric  knowledge  about  the  surface. 
We  showed  as  an  example  the  approximation  of  a 
surface  locally  as  a  quadric.  This  can  replace  the 
reliance  on  singular  points  and  boundaries  with  their 
serious  instability  problems. 

Implementation  is  feasible  because  we  need  only 
second-order  derivatives  of  the  brightness.  Exper¬ 
iments  are  under  way. 
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Abstract 

We  propose  a  segmentation  method  based 
on  Polya’s  model  for  contagious  phenom¬ 
ena.  An  initial  segmentation  is  obtained 
using  a  Maximum  Likelihood  (ML)  es¬ 
timate  or  the  Nearest  Mean  Classifier 
(NMC).  The  resulting  clusters  are  then 
subjected  to  a  morphological  process  oper¬ 
ating  like  the  development  of  an  infection 
to  yield  segmentation  of  the  image  into  ho¬ 
mogeneous  regions.  This  process  is  imple¬ 
mented  using  contagion  urn  processes  and 
generalizes  Polya’s  scheme  by  allowing  spa¬ 
tial  interactions.  The  urn  mixture  model 
provides  a  fuzzy  representation  of  the  pixel 
label.  The  composition  of  the  urns  is  iter¬ 
atively  updated  by  assuming  a  Markovian 
relationship  between  neighboring  pixel  la¬ 
bels.  The  asymptotic  behavior  of  this  pro¬ 
cess  is  examined.  Examples  of  the  appli¬ 
cation  of  this  scheme  to  the  segmentation 
of  Synthetic  Aperture  Radar  (SAR)  images 
and  Magnetic  Resonance  Images  (MRI)  are 
provided. 

1  Introduction 

We  describe  a  segmentation  method  using  contagion 
urn  schemes  that  rely  on  a  modified  version  of  the 
Polya- Eggenberger  sampling  process  [8].  This  bio¬ 
logically  inspired  sampling  procedure  was  originally 
designed  to  model  the  development  of  contagious 
phenomena. 

Many  approaches  to  segmentation  have  been  stud¬ 
ied.  Unsupervised  segmentation  approaches  include 
Nearest  Mean  Classification  (NMC)  and  the  branch- 
and-bound  procedure  [3].  Supervised  methods  gen¬ 
erally  proceed  by  formulating  statistical  model  as¬ 
sumptions  for  the  image  formation  and  region  gener¬ 
ation  processes.  Maximum  likelihood  (ML)  or  maxi¬ 


mum  a  posteriori  (MAP)  estimation  is  then  used  for 
segmentation.  Examples  of  such  approaches  abound 
in  the  literature  [4;  6;  ll].  Techniques  modeling  im¬ 
ages  as  Markov  random  fields  (MRFs)  have  been  ex¬ 
tensively  investigated  [4].  MRFs  attempt  to  repre¬ 
sent  spatial  dependencies  and  the  MRF-Gibbs  equiv¬ 
alence  allows  for  the  computation  of  the  maximum 
a  posteriori  (MAP)  estimate  of  the  original  image. 

This  paper  models  images  using  urn  processes.  The 
motivation  for  employing  urn  schemes  is  twofold: 
First,  urn  processes  can  generate  Markov  chains  as 
well  as  MRFs  [5].  Second,  urn  schemes  are  of  partic¬ 
ular  interest  because  they  provide  a  natural  repre¬ 
sentation  for  fuzzy  image  labeling.  Therefore,  they 
constitute  an  attractive  generative  process  for  the 
underlying  image  regions  which  exhibit  strong  spa¬ 
tial  dependencies.  Our  work  is  related  to  the  Gibbs 
sampling  procedure  [4],  preserving  key  features  of 
the  Gibbs  sampler  but  using  instead  a  contagion 
sampling  scheme.  The  spatial  dependencies  of  the 
pixel  labels  are  captured  by  the  contagious  behavior 
which  promotes  smoothing  of  the  image  into  con¬ 
tiguous  regions.  The  urn  process  for  segmentation 
is  related  to  relaxation  labeling  algorithms  [lO],  ex¬ 
cept  that  the  urn  process  is  not  deterministic. 

We  begin  by  applying  either  an  ML  or  NMC  seg¬ 
mentation  technique  to  the  image.  The  contagion 
process  is  then  applied  to  the  image  labels.  In  this 
scheme,  each  pixel  is  represented  by  an  urn  with  a 
mixture  of  balls  of  different  colors,  one  color  for  each 
class  label.  A  neighborhood  system  is  also  defined 
on  each  pixel.  The  balls  of  the  urns  of  the  neighbor¬ 
hood  system  are  then  combined  to  determine  the 
next  states  of  the  urns.  The  iterative  nature  of  the 
algorithm  incorporates  temporal  memory,  while  the 
inclusion  of  the  neighboring  urns  in  the  update  pro¬ 
motes  spatial  contagion.  Moreover,  the  neighbor¬ 
hood  system  is  modified,  pending  the  existence  of 
an  edge  element  in  the  neighborhood.  This  is  done 
to  preserve  edges  by  confining  the  propagation  of 
similar  class  labels  within  closed  boundaries. 
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This  paper  is  organized  as  follows:  The  initial  NMC 
and  ML  segmentations  are  presented  in  Section  2. 
The  contagion-based  smoothing  process  is  then  de¬ 
scribed  in  Section  3.  In  Section  4,  the  stochastic 
properties  of  the  resulting  image  process  are  dis¬ 
cussed.  Finally,  experimental  results  on  SAR  and 
MR  images  are  shown  in  Section  5. 

2  Initial  Segmentation 

When  no  a  priori  information  on  the  image  statis¬ 
tics  is  available,  general  clustering  algorithms  such 
as  NMC  are  usually  applied.  In  the  NMC  method, 
an  initial  arbitrary  labeling  is  used  from  which  cen¬ 
troids  of  the  feature  vectors  of  each  class  are  com¬ 
puted.  Next,  all  samples  are  reclassified  to  the  clus¬ 
ter  corresponding  to  the  nearest  mean,  and  the  cen¬ 
troids  are  recomputed.  This  process  is  iterated  until 
a  stopping  criterion  is  met  [3]. 

In  contrast,  when  a  stochastic  model  for  the  image 
can  be  justified,  it  is  possible  to  apply  ML  segmen¬ 
tation.  The  conditional  distribution  of  the  image, 
i.e.,  the  form 

p{XjCs=l,Cr\reN^)  (1) 

is  assumed.  Here,  Cj  is  the  label  for  pixel  s,  Cr 
represents  the  pixel  labels  of  ,  the  order 
neighborhood  of  pixel  s,  and  X,  is  the  given  image 
data  [4]. 

For  segmentation  purposes,  we  estimate  the  pixel 
labels  by  assuming  that  the  conditional  probability 
of  each  class  label,  i.e.  p{Xs/C,  =  /),  is  governed  by 
a  multivariate  Gaussian  distribution  on  the  second- 
order  neighborhood  N^. 

After  obtaining  the  parameters  of  the  different 
classes,  the  ML  test  determines  the  label  for  each 
pixel  in  the  image.  The  ML  decision  rule  is 

/  =  argmax  p{Xs/Cs  =  I,  Cr',  r  E  Nj).  (2) 

The  above  schemes  do  not  capture  the  statistics  and 
connectedness  of  local  regions.  Since  the  ML  test  as¬ 
sumes  that  each  pixel  label  is  equally  likely  through¬ 
out  the  image,  it  produces  a  noisy  segmentation. 
This  assumption  is  incorrect,  for  in  a  local  region 
dominated  by  one  class,  the  dominant  class  has  a 
higher  prior  probability  than  the  other  classes.  Such 
contextual  information  is  not  taken  into  account  in 
either  the  ML  or  NMC  estimate  of  the  pixel  labels. 
This  drawback  is  usually  addressed  within  the 
framework  of  MAP  segmentation.  The  MAP  esti¬ 
mate  of  the  class  label  /  for  a  pixel  given  the  observed 
image  X,  is 

i  =  arg max  p{Cs  =  l/Cr',r  G  N^,Xs)  (3) 


Indeed,  it  can  be  shown  that  maximizing  p{Cs  = 
l/Cr',r  G  Nj,X,)  is  equivalent  to  maximizing 

p{C,  =  //C'r;r  G  N^)p{XjC,  =  1). 

If  segmentation  of  the  image  into  homogeneous  re¬ 
gions  is  desired,  it  is  intuitively  appealing  to  model 
the  prior  distribution  p{C,  =  llCr',r  G  Nj)  using  an 
MRF,  as  the  MRF  model  relates  the  label  of  a  pixel 
to  the  labels  of  its  neighboring  pixels  [6]. 

If  the  prior  is  modeled  as  an  MRF,  the  Gibbs-MRF 
equivalence  can  be  exploited  by  techniques  such  as 
simulated  annealing  (SA)  or  other  stochastic  relax¬ 
ation  methods  to  derive  the  MAP  estimate  [6]. 
Unfortunately,  techniques  such  as  simulated  anneal¬ 
ing  have  high  computational  costs.  Indeed,  conver¬ 
gence  to  the  MAP  estimate  is  possible  only  when 
impractically  slow  annealing  schedules  are  followed. 
Instead,  we  propose  to  replace  the  annealing  step  by 
an  urn  contagion  process  to  model  the  spatial  de¬ 
pendencies  between  neighboring  pixels. 

3  Image  Sampling  with  Contagion 

The  labeled  image  is  described  by  an  urn  process. 
Each  pixel  in  the  image  is  represented  by  an  urn 
containing  a  mixture  of  balls  of  different  colors  rep¬ 
resenting  the  classes.  The  proportion  of  each  class 
in  the  urn  indicates  the  similarity  of  the  pixel  to  the 
class.  The  urn  representation  is  therefore  a  fuzzy 
representation  of  the  segmented  image.  At  each  it¬ 
eration  the  current  urn  is  modified  by  re-sampling 
with  contagion,  a  process  that  is  inspired  by  the 
original  urn  sampling  process  introduced  by  Polya 
and  Eggenberger  in  [8]. 

3.1  Temporal  Contagion 

The  work  reported  in  [8]  introduced  the  following 
urn  scheme  as  a  model  for  the  spread  of  a  conta¬ 
gious  disease  through  a  population.  An  urn  origi¬ 
nally  contains  T  balls,  of  which  W  are  white  and  B 
are  black  {T  =W  +  B).  Successive  draws  from  the 
urn  are  made;  after  each  draw,  1  -f  A  (A  >  0)  balls 
of  the  same  color  as  was  just  drawn  are  returned  to 
the  urn.  Let  p  =  W/T  and  8  =  A/T.  Define  the 
binary  process  ^  follows: 

_  f  0,  if  the  n*'*  ball  drawn  is  white; 

"  “  \  1,  if  the  ball  drawn  is  black. 

It  can  be  shown  that  the  process  {Zn}  is  stationary 
and  non-ergodic  [7;  9].  The  urn  scheme  has  infinite 
memory,  in  the  sense  that  each  previously  drawn  ball 
has  an  equal  effect  on  the  outcome  of  the  current 
draw. 
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3.2  Temporal  and  Spatial  Contagion  as  in  [4]: 


The  urn  sampling  scheme  proposed  in  this  paper  in¬ 
corporates  both  temporal  and  spatial  contagion.  In¬ 
stead  of  representing  an  image  by  a  finite  lattice  of 
pixels,  we  consider  an  image  as  a  finite  lattice  of 
urns.  In  the  “one-dimensional”  urn  sampling  de¬ 
scribed  above,  the  effect  of  each  sample  propagates 
through  time.  For  the  “two-dimensional”  case,  the 
sampled  ball  at  each  iteration  must  depend  not  only 
on  the  composition  of  the  pixel’s  urn,  but  also  on 
the  compositions  of  the  neighboring  urns  to  encour¬ 
age  contagious  behavior.  Thus,  we  need  to  allow 
for  spatial  interactions  at  each  time  instant  by  as¬ 
sociating  the  urns  of  the  neighboring  pixels  in  the 
determination  of  the  newly  sampled  ball. 

3.3  A  Fuzzy  Image  Labeling 
Representation 

The  following  presentation  considers,  without  loss 
of  generality,  a  binary  labeling  problem.  Let  /„  = 
be  a  binary  label  image  of  size  K  x  L,  where 
G  {0, 1}  is  the  label  of  pixel  (i,  j)  at  iteration 
n,  n  =  0, 1, . . .,  (i,j)  e  I  where 

To  each  pixel  we  associate  an  urn  : 

Wn'^^)  with  each  pixel  (z,  j)  at  time n,  where 
Bn'^^  and  Wn’^^  are  respectively  the  number  of 
black  and  white  balls  in  the  urn.  With  this  repre¬ 
sentation  we  define  a  membership  function  for  each 
pixel  as 


B 


(iJ) 


3.4  An  Algorithm  for  Segmentation 
with  Spatial  Contagion 

The  general  class  of  algorithms  for  the  contagion- 
based  smoothing  process  can  be  described  as  follows: 

•  Initialization 

Let  lo  be  an  initial  segmentation  (at  time  index 
n  =  0).  For  each  pixel  the  initial  urn  com¬ 
position  is  obtained  by  com¬ 

puting  the  relative  frequencies  of  white  and  black 
pixels  in  a  spatial  neighborhood  centered  on  {i,j). 
For  this  work,  the  second  order  (3x3)  neighborhood 
system  for  each  pixel  is  adopted. 

•  Iterative  Image  Sampling 

For  n  >  0,  the  urn  composition  of  each  pixel  {i,j) 
is  updated  by  sampling  from  a  combination  of  the 
participating  urns  with  :  (r,  s)  G 

Ng},  where  is  the  neighborhood  system  defined 


:{q-{r,s)el  -.{i-  r)^  +  {j  -  sf  <  k}. 

A  simple,  yet  effective,  sampling  procedure  is  as  fol¬ 
lows:  the  urn  for  pixel  is  updated  by  first 

combining  the  balls  of  u^n-l  the  N  neighboring 

urns: 

=  ASSOCIATE(V^i^i^).  (4) 

The  ASSOCIATE  function  forms  a  collection  of 
balls,  from  the  urns  of  the  neighborhood.  Ex¬ 
amples  of  the  ASSOCIATE  function  include  group¬ 
ing  the  urns  of  into  a  “super”  urn  or  sam¬ 

pling  one  ball  from  each  urn  to  form  the  collection. 
Furthermore,  the  neighborhood  may  be  modified  if 
an  edge  element  exists  in  that  neighborhood;  if  so, 
those  neighboring  urns  which  lie  on  the  other  side  of 
the  edge  are  excluded.  This  is  necessary  to  preserve 
edges  and  limit  contagion  to  local  areas. 

Next,  an  operation  on  the  new  collection  of  balls, 
^n-i>  performed,  i.e. 

=  SELECT(C^l\^)-  (5) 

The  SELECT  function  may  determine  the  next  state 
of  the  urns  by  sampling  one  ball  from  j  or  by 
taking  the  majority  class  of  . 

We  denote  by  Zn’^'^  the  outcome  of  the  SELECT 
function: 

drawn  is  white; 

"  \  1,  if  the  ball  drawn  is  black. 

If  Zn'^^  =  0,  add  A  white  balls  to  urn  if 

Zn’^  =  1,  add  A  black  balls  to  urn  Un’^\ 

This  yields  a  new  urn  composition  for  each  pixel, 
given  by 

+  (1  - 

The  above  procedure  is  iterated  until  n  =  N.  At 
time  N,  the  final  composition  of  each  individual  urn 
(lj)  £  ^  determines  the  final  labeling  of 
the  image.  As  described  above,  each  urn  represents 
a  fuzzy  membership  function  on  the  pixel  labels. 

4  Statistical  Properties 

4.1  Temporal  Contagion 

The  resulting  sequence  of  generated  images  exhibits 
both  spatial  and  temporal  dependencies  represented 
by  a  Markovian  relationship  in  terms  of  the  urns 
more  specifically: 

Pr{u^^<^^\Un-i,  Un-2,  ...,Uo}=  J)|  V^)}, 
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where  C/„  :  is  the  urn  matrix  associated  with 

In,  and  Vn^^i  is  the  set  of  participating  urns  defined 
in  the  previous  section. 

Consider  the  original  Polya  sampling  scheme.  The 
asymptotic  properties  of  the  joint  distribution  can 
be  characterized  in  the  “one- dimensional”  case,  i.e., 
when  all  spatial  interactions  are  inhibited  at  each 
sampling  step.  In  this  case,  it  can  be  shown  [9]  that 
the  proportion  of  white  balls  in  each  urn  after  the 
n*'*  trial  pn’^\  where 


= 


■  +  zi 


1  -h  n5 


is  a  martingale  [2]  and  admits  a  limit  Z  as  the  num¬ 
ber  of  draws  increases  indefinitely.  Indeed,  pn’^^  (or 
equivalently  the  sample  average  ^  Ylk=i 
verges  with  probability  I  to  Z  [2].  This  limiting 
proportion  Z  is  a  continuous  random  variable  with 
support  the  interval  (0, 1)  and  beta  probability  den¬ 
sity  function  with  parameters  (p/6,  (1  -  p)/6): 


balls  and  w,  white  balls,  and  bi  +  Wi  —  T  for  all  i, 
i  =  1,2, . . .,  N .  We  put  the  contents  of  all  N  urns 
into  a  “super”  urn,  sample  one  ball,  and  add  A  balls 
of  the  same  color  into  the  urn  of  pixel  Xs .  The  fol¬ 
lowing  properties  are  easily  derived: 

The  probability  of  sampling  exactly  k  black  balls 
from  n  iterations  of  the  “super”  urn  is 


Pr{X  =  k) 


fTi\  B(a  +  k,l3+n-k)  .  . 

\k)  B{a,l3)  ^  ’ 


where  a  =  Yli  a  ’  ^  ~  function 


B{a,p)  —  ■ 

The  above  process  can  be  regarded  as  being  gener¬ 
ated  by  a  sequence  of  independent  Bernoulli  trials 
with  parameter  Z,  where  Z  is  random  with  beta 
distribution.  In  fact,  it  is  identical  with  different 
parameters  to  the  Polya-Eggenberger  distribution  in 
the  “one-dimensional”  case  given  above. 

The  average  number  of  black  balls  in  the  “super” 
urn  at  any  given  time  is  expressed  as 


hi^)  =  •{ 


mm 


r(p/6)rdi-p)/s) 

if  0<  z  <1-, 

0,  otherwise. 


ze 


(1-2)- 


r(-)  is  the  gamma  function  described  by 


for  a:  >  0. 


The  behavior  of  this  pdf  can  be  interpreted  as  fol¬ 
lows:  Assuming  (5  =  1  for  simplicity,  if  the  original 
fraction  of  white  balls  in  the  urn  is  close  to  1,  then 
the  limiting  distribution  of  will  be  skewed  to¬ 

wards  1.  A  similar  behavior  is  obtained  for  the  case 
when  p  is  close  to  0.  Therefore,  the  limiting  pattern 
will  reflect  the  underlying  probability 


Pr(pP  =  a:)=/(l-p)('-"). 

For  the  M-ary  labeling  case,  the  above  observations 
generalize  with  convergence  to  the  Dirichlet  distri¬ 
bution  [5]. 

4.2  Temporal  and  Spatial  Contagion 

We  examine  the  asymptotic  behavior  of  two  exam¬ 
ples  of  a  general  urn  sampling  scheme  for  segmenta¬ 
tion. 


•  Method  1 

Consider  sampling  from  the  “super”  urn.  Restat¬ 
ing  the  problem,  suppose  there  are  N  urns  in  the 
neighborhood  of  pixel  X,,  each  initially 'filth  bi  black 


E[Bn]  = 

i 


AT-f-nA 

NT 


(7) 


Therefore,  the  average  proportion  of  black  balls  in 
the  “super”  urn  is 


E 


B 


n 


(NT  +  nA) 


Eh 

NT 

3 


(8) 


Remarkably,  the  average  proportion  of  black  balls  in 
the  “super”  urn  at  any  time  instant  equals  the  origi¬ 
nal  proportion  of  black  balls.  The  above  results  show 
that  the  composition  of  the  urn  is  highly  dependent 
on  the  original  proportion  of  the  balls.  Eventually, 
the  majority  class  of  the  urns  in  a  given  neighbor¬ 
hood  will  spread  and  dominate  the  population  of 
balls  in  that  neighborhood.  Therefore,  we  conclude 
that  this  urn  sampling  scheme  will  reinforce  the  ma¬ 
jority  class  in  a  local  spatial  neighborhood;  it  consti¬ 
tutes  a  positive-feedback  system  that  yields  limiting 
patterns  of  the  self-reinforcing  type  [l].  The  conta¬ 
gion  effectively  models  the  Markovian  dependencies 
of  the  pixel  labels. 

•  Method  2 

This  second  example  is  described  as  follows:  We 
sample  one  ball  from  each  of  the  urns  in  pixel  A(i  j)’s 

neighborhood,  .  From  this  collection  of  balls, 
we  compute  the  majority  class,  denoted  by  Zn'^\ 
We  update  urn  in  the  same  manner  described 
in  the  previous  section,  i.e. 


f  -f  (1  -  Zi'’^'>)  *  A 

\  +  (zi^’^^)  *  A 
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The  composition  of  urn  Un'^^  is  governed  by  the 
Polya-Eggenberger  distribution  as  explained  above. 
Eventually,  the  initial  majority  class  of  each  urn  in 
the  neighborhood  will  dominate  its  composition. 

It  is  difficult  to  find  a  general  closed-form  expression 
for  P(Zn'^^  =  k),  the  probability  that  class  k  is  the 
majority  of  the  individual  samples.  The  difficulty 
arises  because  we  are  trying  to  find  the  majority  of 
a  set  of  samples  of  a  non-i.i.d.  process.  Hence,  we 
resort  to  heuristic  arguments.  Experimental  results 
will  be  given  in  the  next  section. 

5  Experimental  Results 

For  segmentation  of  SAR  imagery,  we  start  with  ML 
segmentation.  As  shown  in  Figure  1(a),  the  resulting 
labeling  is  spotty,  a  characteristic  of  the  ML  segmen¬ 
tation  technique.  Application  of  simulated  anneal¬ 
ing  generates  a  contiguous  segmentation  of  the  im¬ 
age  (Figure  1(b)).  Likewise,  Figure  1(c)  shows  that 
ten  iterations  of  the  urn  sampling  scheme  operating 
on  the  ML  segmentation  yield  an  image  labeled  into 
locally  homogeneous  regions.  The  SAR  image  used 
in  this  example  is  from  Lincoln  Laboratory’s  ADTS 
SAR  data,  which  is  fully  polarimetric  with  1  foot 
resolution. 

Whereas  simulated  annealing  achieves  segmentation 
by  optimizing  a  function  (the  MAP  pixel  label  esti¬ 
mate),  modified  urn  schemes  smooth  the  segmented 
image  by  morphologically  processing  the  pixel  la¬ 
bels.  Since  the  Polya  urn  schemes  model  contagious 
behavior  in  a  population,  modified  urn  schemes  al¬ 
low  dominant  pixel  labels  to  propagate  within  local 
regions,  analogous  to  diffusion  methods  for  segmen¬ 
tation.  The  advantage  of  using  the  urn  scheme  lies 
in  the  reduction  of  the  computational  complexity  of 
the  segmentation  algorithm.  We  avoid  the  time  and 
computational  costs  of  simulated  annealing  by  em¬ 
ploying  a  simpler  algorithm. 

To  segment  the  MR  images,  we  obtain  an  initial  seg¬ 
mentation  by  NMC.  The  inherent  noise  of  this  image 
modality  leads  to  a  speckled  segmentation.  The  con¬ 
tagion  urn  process  then  operates  on  the  pixel  labels 
to  produce  a  smoother  segmentation.  The  outputs 
after  one  and  ten  iterations  are  shown  in  Figures 
2(b)  and  2(c),  respectively.  Note  that  the  edges  are 
preserved  by  limiting  contagion  to  local  areas.  An 
edge  map,  computed  by  the  Canny  edge  detector, 
is  employed  to  modify  each  pixel’s  sampling  neigh¬ 
borhood  to  prohibit  sampling  over  different  region 
types. 

In  both  cases,  sampling  method  2  is  implemented; 
one  ball  is  sampled  from  each  of  the  urns  of  the 
neighborhood  and  a  majority  rule  is  applied  to  de¬ 
termine  the  next  state  of  the  urns.  Each  urn  is  ini¬ 


tialized  with  ten  balls,  and  A,  the  number  of  balls 
added  at  each  iteration,  is  2. 

6  Conclusion 

In  this  paper,  we  have  illustrated  how  modified  Polya 
urn  sampling  schemes  can  be  implemented  for  image 
segmentation.  Given  an  initial  speckled  segmenta¬ 
tion,  the  contagion  process  obtains  a  smoother  seg¬ 
mentation  into  homogeneous  regions  by  its  Marko¬ 
vian  properties.  Two  general  properties  incorpo¬ 
rate  temporal  and  spatial  contagion.  First,  iterative 
updating  is  required  for  temporal  contagion.  Sec¬ 
ond,  sampling  from  neighboring  urns,  similar  to  the 
Gibbs  sampler,  is  necessary  to  encourage  spatial  con¬ 
tagion. 

Further  lines  of  research  include  the  evaluation  of 
optimal  values  for  the  parameter  S,  the  ratio  of  A  to 
the  initial  number  of  balls  in  an  urn.  If  6  is  too  high, 
the  segmentation  is  over-smoothed;  if  it  is  too  low, 
the  algorithm  may  not  converge  to  the  appropriate 
segmentation.  As  mentioned  above,  the  initial  com¬ 
position  of  the  urns  determines  to  a  great  extent  the 
outcome  of  the  contagion  process.  Therefore,  finding 
an  appropriate  method  of  initializing  the  urn  compo¬ 
sition  is  critical  to  accurately  segmenting  the  image. 
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(a)  Noisy  NMC  segmentation.  (b)  After  one  iteration  of  urn  process. 


(c)  After  ten  iterations  of  urn  process. 

Figure  2:  Segmentation  of  MR  Images  using  urn  process  with  inhibition 
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Abstract 


If  3D  rigid  motion  is  estimated  with  some 
error  a  distorted  version  of  the  scene  struc¬ 
ture  will  in  turn  be  computed.  Of  compu¬ 
tational  interest  are  those  regions  in  space 
where  the  distortions  are  such  that  the 
depths  become  negative,  because  in  order 
to  be  visible  the  scene  has  to  lie  in  front  of 
the  image.  The  stability  analysis  for  the 
structure-from-motion  problem  presented 
in  this  paper  investigates  the  optimal  re¬ 
lationship  between  the  errors  in  the  esti¬ 
mated  translational  and  rotational  parame¬ 
ters  of  a  rigid  motion,  that  results  in  the  es¬ 
timation  of  a  minimum  number  of  negative 
depth  values.  The  input  used  is  the  value 
of  the  flow  along  some  direction,  which  is 
more  general  than  optic  flow  or  correspon¬ 
dence.  For  a  planar  retina  it  is  shown  that 
the  optimal  configuration  is  achieved  when 
the  projections  of  the  translational  and  ro¬ 
tational  errors  on  the  image  plane  are  per¬ 
pendicular.  Furthermore,  the  projections 
of  the  actual  and  the  estimated  translation 
lie  on  a  line  passing  through  the  image  cen¬ 
ter.  For  a  spherical  retina,  given  a  rota¬ 
tional  error,  the  optimal  translation  is  the 
correct  one,  while  given  a  translational  er¬ 
ror  the  optimal  rotational  error  is  normal 
to  the  translational  one  at  an  equal  dis¬ 
tance  from  the  real  and  estimated  transla¬ 
tions.  The  proofs,  besides  illuminating  the 
confounding  of  translation  and  rotation  in 
structure  from  motion,  have  an  important 
application  to  ecological  optics,  explaining 
differences  between  planar  and  spherical 
eye  or  camera  designs  in  motion  and  shape 
estimation. 


1  Introduction 

The  general  problem  of  strncture  from  motion  is  de¬ 
fined  as  follows;  given  a  number  of  views  of  a  scene, 
to  recover  the  rigid  transformations  between  the 
views  and  the  strnctnre  (shape)  of  the  scene.  In  the 
field  of  computational  vision  a  lot  of  effort  has  been 
devoted  to  this  problem  because  it  lies  at  the  heart 
of  several  applications  in  pose  estimation,  recogni¬ 
tion,  calibration,  and  navigation  [Faugeras,  1992; 
Hartley,  1994;  Maybank,  1993]. 

While  many  solutions  have  been  proposed,  they  be¬ 
come  problematic  in  the  case  of  realistic  scenes  and 
most  of  them  degrade  ungracefully  as  the  quality  of 
the  input  deteriorates.  This  has  motivated  research 
on  the  stability  of  the  problem;  Daniilidis  and  Spet- 
sakis  [1996]  contains  an  excellent  survey  of  existing 
error  analyses.  In  summary,  it  can  be  concluded 
that  the  majority  of  the  existing  analyses  attempt 
to  model  the  errors  in  either  the  3D  motion  esti¬ 
mates  or  the  depth  estimates,  and  dne  to  the  large 
number  of  unknowns  in  the  problem,  they  deal  with 
restricted  conditions  such  as  planarity  of  the  scene 
or  non-biasedness  of  the  estimators.  Notably  absent 
in  published  efforts  is  an  account  of  the  systematic 
nature  of  the  errors  in  the  depth  estimates  due  to 
errors  in  the  3D  motion  estimates. 

In  this  paper  an  approach  that  is  independent  of 
any  algorithm  or  estimator  is  taken.  Due  to  the  ge¬ 
ometry  of  image  formation  any  spatiotemporal  rep¬ 
resentation  in  the  image  is  due  to  the  3D  motion 
and  the  structure  of  the  scene.  If  the  3D  motion 
can  be  estimated  correctly,  the  structure  can  be  de¬ 
rived  correctly  using  the  eqnations  of  image  forma¬ 
tion.  However,  an  error  in  the  estimation  of  the  3D 
motion  will  result  in  the  computation  of  a  distorted 
version  of  the  actual  scene  strncture.  Of  computa¬ 
tional  interest  are  those  regions  in  space  where  the 
distortions  are  such  that  the  depths  become  nega¬ 
tive.  Not  considering  any  scene  interpretation,  the 
only  fact  we  know  about  the  scene  is  that  for  it  to  be 
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visible  it  has  to  lie  in  front  of  the  image  and  thus  the 
depth  estimates  have  to  be  positive.  Therefore  the 
number  of  image  points  whose  corresponding  scene 
points  would  yield  negative  values  due  to  erroneous 
3D  motion  estimation  should  be  kept  as  small  as 
possible.  This  is  the  computational  principle  behind 
the  error  analysis  presented  in  this  paper.  In  partic¬ 
ular,  the  following  questions  are  studied.  Assuming 
there  is  an  error  in  the  estimation  of  the  rotational 
motion  components,  what  is  the  error  in  the  trans¬ 
lational  components  that  leads  to  a  minimization 
of  the  number  of  negative  depth  values  computed? 
Similarly,  if  there  is  an  error  in  the  translational  mo¬ 
tion  estimates,  which  rotational  error  will  result  in 
the  smallest  number  of  negative  depth  values?  The 
analysis  is  carried  out  for  a  complete  field  of  view  as 
perceived  by  an  imaging  sphere,  and  for  a  restricted 
field  of  view  on  a  constrained  image  plane. 

2  Overview  and  problem  statement 

2.1  Prerequisites 

We  consider  an  observer  moving  rigidly  with  trans¬ 
lation  t  =  {U,V,W)  and  rotation  w  =  in 

a  stationary  environment,  and  an  image  formation 
based  on  perspective  projection.  If  the  image  is 
formed  on  a  sphere  of  radius  /  (i.e.,  r  •  r  =  /^), 
the  2D  velocity  r  at  an  image  point  r  =  {x,y,z) 
corresponding  to  a  scene  point  R  =  {X,Y,Z)  is 

^  “]W  ^ 

where  1R|,  being  the  norm  of  R,  denotes  the  range, 
and  and  Vrot(r)  denote  the  translational  and 

rotational  components  respectively. 

Similarly,  if  the  image  is  formed  on  a  plane  orthogo¬ 
nal  to  the  Z  axis  at  distance  /  from  the  nodal  point, 
the  2D  image  velocity  is 


vtr(r) 


+  Vrot(r) 


=  -;^(zo  X  (t  xr)) 

-b  j(zo  X  (r  X  (w  X  r)))  (2) 

where  zq  denotes  a  unit  vector  in  the  direction  of 
the  Z  axis. 

The  component  of  the  flow  w„  along  any  direction  n 
is  therefore 

u„  =  r-n  =  — -n+Vrot-n  or  i^-n-l-Vrofn 

Since  only  the  direction  of  translation  t/|tl  and  the 
depth  (range)  of  the  scene  up  to  a  scaling  factor  can 

be  obtained,  that  is,  ^  (W)’  ~ 


2.2  Distorted  space 

Let  us  assume  there  is  an  error  in  the  estimation 
of  the  five  motion  parameters.  As  a  consequence 
there  will  also  be  errors  in  the  estimation  of  depth 
(range)  and  thus  a  distorted  version  of  the  space 
will  be  computed.  A  convenient  way  to  describe  the 
distortion  of  space  is  to  sketch  it  through  surfaces  in 
space  which  are  distorted  by  the  same  multiplicative 
factor,  the  iso-distortion  surfaces. 

In  the  following,  in  order  to  distinguish  be¬ 
tween  the  various  estimates,  we  use  letters  with 
hat  signs  to  represent  the  estimated  quantities 
(t,d>,  |R|,.^,  vtr,Vrot)  and  unmarked  letters  to  rep¬ 
resent  the  actual  quantities  (t,  cu,  |R|,  .Z,  Vtr,  Vrot)- 
The  subscript  “e”  is  used  to  denote  errors,  where 
we  define  o)  —  w  =  u>£  and  Vrot  —  Vrot  =  Vrot,  • 

The  estimated  depth  or  range  can  be  obtained  from 
(3)  as 

Z  (or  |R|)  =  .  - 

V  r  ■  n  -  Vrot  •  n 

and  thus  on  the  image  sphere  we  get 


|R|  =  |R| 


(r  X  (r  X  t))  •  n 


(r  X  (r  X  t))  •  n  -t-  f\R\  (u>£  x  r)  •  n 


From  (4)  it  can  be  seen  that  |r|  can  be  expressed 
as  a  multiple  of  |R|,  where  the  multiplicative  factor, 
which  we  denote  by  D,  the  distortion  factor,  is  given 
by  the  term  inside  the  brackets.  Thus  the  distortion 
factor  is 

n  _ _ (r  X  (r  X  t))  •  n _ 

(r  X  (r  X  t))  •  n  -f-  /|R|  (cUe  x  r)  •  n 

Similarly,  on  the  image  plane  we  can  interpret  the 
estimated  depth  as  a  multiple  of  the  actual  depth 
with  distortion  D,  where 

n  -  -/  (zQ  X  (t  X  r))  •  n 

-/  (zo  X  (t  X  r))  •  n-b 
Z  (zo  X  (r  X  (u>e  X  r)))  -n 

Equations  (5)  and  (6)  describe,  for  any  fixed  di¬ 
rection  n  and  any  distortion  factor  D,  a  surface  in 
space.  Any  such  surface  is  to  be  understood  as  the 
locus  of  points  in  space  which  are  distorted  in  depth 
(range)  by  the  same  factor  D,  if  the  corresponding 
image  measurements  are  in  direction  n. 

Figure  1  illustrates  a  family  of  iso-distortion  surfaces 
corresponding  to  the  same  gradient  direction  but  dif¬ 
ferent  distortion  factors  D.  As  can  be  seen  the  iso¬ 
distortion  surfaces  of  a  family  intersect  in  a  curve, 
and  they  change  continuously  as  we  vary  D.  Thus  all 
the  space  between  the  0  distortion  surface  and  the 


— oo  distortion  surface  (which  is  also  the  +oo  dis¬ 
tortion  surface)  is  distorted  by  a  negative  distortion 
factor. 


Figure  1;  Family  of  iso-distortion  surfaces. 

2.3  Description  of  results 

In  the  forthcoming  sections  we  employ  a  geometric 
statistical  model  to  represent  the  negative  depth  val¬ 
ues.  We  assume  that  the  scene  in  view  lies  within  a 
certain  depth  (range)  interval.  The  flow  representa¬ 
tion  vectors  in  the  image  are  in  different  directions, 
and  we  assume  some  distribution  for  their  directions. 
Our  focus  is  on  the  points  in  space  which  for  a  3D 
motion  estimate  yield  negative  depth  (range)  esti¬ 
mates. 

For  every  direction  n  the  points  in  space  with  neg¬ 
ative  depth  estimates  cover  the  space  between  the  0 
and  —00  distortion  surfaces  within  the  range  covered 
by  the  scene.  Thus  for  every  direction  we  obtain  a 
3D  subspace,  covering  a  certain  volume.  The  sum 
of  all  volumes  for  all  directions,  normalized  by  the 
flow  distributions  considered,  represents  a  measure 
of  the  likelihood  that  negative  depth  values  occur. 
We  call  it  the  “negative  depth  volume”  or  “negative 
range  volume.”  The  idea  behind  our  error  analy¬ 
sis  lies  in  the  minimization  of  this  negative  depth 
(range)  volume — that  is,  we  are  interested  in  the  re¬ 
lationship  between  the  translational  and  rotational 
motion  errors  that  minimizes  this  volume. 

In  our  analysis  we  assume  that  the  flow  directions 
are  uniformly  distributed  in  every  direction  and 
at  every  depth  (range)  between  a  minimum  value 
^min(|Rmin|)  and  a  maximum  value  ^max(|Rmax|)- 
In  summary,  as  an  answer  to  the  question  about  the 
coupling  of  motion  errors,  the  following  results  are 
obtained: 

A:  If  we  take  the  whole  sphere  as  the  imaging  sur¬ 
face  and  we  assume  an  error  in  the  estimation  of 
rotation,  then  the  direction  of  translation  that  min¬ 
imizes  the  negative  depth  volume  is  the  correct  di¬ 
rection  of  translation. 


motion  estimation  is  most  easily  accomplished  for  a 
complete  field  of  view,  as  provided  by  an  imaging 
sphere.  A  working  system  (biological  or  artificial)  is 
usually  equipped  with  an  inertial  sensor  which  pro¬ 
vides  rotational  information,  though  probably  with 
some  error.  On  the  basis  of  this  information,  the 
best  one  can  do  to  estimate  the  remaining  trans¬ 
lation  is  to  assume  that  the  flow  field  obtained  by 
subtracting  the  estimated  rotation  is  purely  trans¬ 
lational  and  apply  a  simple  algorithm  designed  for 
only  translation  [Horn  and  Weldon,  Jr.,  1988]. 
Estimation  of  purely  translational  motion  is  much 
simpler  than  estimation  of  complete  3D  rigid  mo¬ 
tion,  which  requires  techniques  that  decouple  the 
translation  from  the  rotation  in  some  way.  Thus  in¬ 
sects  with  spherical  eyes,  such  as  bees  and  flies,  have 
a  big  advantage  in  the  task  of  3D  motion  estimation. 

B:  On  the  other  hand,  if  we  assume  a  certain  error  in 
the  estimation  of  translation  on  a  spherical  image, 
we  find  that  the  vector  of  the  rotational  error  lies 
on  the  same  geodesic  as  the  real  translation  t  and 
the  estimated  translation  t  at  equal  distance  from 
both,  that  is,  (t  -f- 1)  x  We  =  0  (t  and  t  are  unit 
vectors). 

C:  Considering  as  imaging  surface  a  plane  of  limited 

extent,  we  find  that  the  translational  and  rotational 

errors  are  perpendicular  to  each  other.  Using  the 

notation  ^  ^  =  JCo.  and  ^  =  j/o. ,  this 

means  that  ^  =  —  ^.  If  we  fix  the  rotational  error 
yoc 

(o-e,  /Se,  7e),  this  provides  us  with  a  constraint  on  the 
direction  of  the  translational  error. 

D:  If  we  fix  the  translational  error  (a;o«)yoJ  we  ob¬ 
tain  the  same  constraint,  and  in  addition  we  find 
that  7e  =  0.  Furthermore,  if  we  only  fix  the  amount 
of  translational  error  we  find  that  y  =  ^. 

These  results  are  in  accordance  with  experimental 
observations  and  with  proofs  derived  under  particu¬ 
lar  simplifying  assumptions. 

The  importance  of  the  results  obtained  for  the  plane 
also  lies  in  their  consequences  for  shape  estimation. 
They  can  be  translated  into  the  statement  that  pla¬ 
nar  retinas  with  high  resolution  at  the  center  are 
advantageous  in  the  computation  of  shape.  As  will 
be  shown  in  Section  5,  if  —  =  —  ^  and  ^  ^ 

near  the  fixation  center  for  any  depth  Z,  the  dis¬ 
tortion  factor  is  approximately  the  same  for  every 
flow  direction!  This  means  that  all  scene  points  of 
the  same  depth  are  distorted  by  the  same  factor  and 
thus  a  depth  map  is  derived  whose  level  contours  are 
the  correct  ones! 


The  practical  implication  of  this  result  is  that  3D 
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Table  1: 


3  Analysis  on  the  sphere 

3.1  Fixed  rotational  error 

As  a  parameterization  for  expressing  the  orienta¬ 
tions  n  we  choose  the  following: 

As  shown  in  Figure  2a,  let  We  be  parallel  to  the  x  axis 
and  let  s  be  the  set  of  all  the  unit  vectors  in  the  yz 
plane  with  s  =  (O.sinx.cosx)  and  x  in  the  interval 
[0  . . .  tt].  The  flow  directions  n  at  every  point  are  de¬ 
fined  as  n  =  In  this  parameterization,  n  takes 

on  every  possible  orientation  in  the  tangent  plane 
at  every  point,  but  not  all  orientations  are  treated 
equally.  In  order  to  obtain  a  uniform  distribution 
we  must  perform  some  normalization. 


Figure  2:  (a)  Parameterization  used  in  the  analysis: 
Wf  =  A(1,0,0),  s  =  (0,sinx,cosx)  withx  e  [0...7r], 
n  =  ||^.  (b)  Parameterization  of  r  for  normalizar 
tion:  ipy  is  the  angle  between  r  and  the  yz  plane;  (px 
is  the  angle  between  the  projection  of  r  on  the  yz 
plane  and  some  fiducial  direction  in  the  yz  plane. 


As  shown  in  [Fermiiller  and  Aloimonos,  1997],  this 
can  be  achieved  by  multiplying  the  volume  V{x)  for 


every  direction  x  hy 


_ sm  (py _ 

cos(¥!y)2cos(x-¥>x)^- 


where  px 


and  <py  are  defined  as  described  in  Figure  2a. 

Our  focus  is  on  the  points  in  space  with  estimated 
negative  range  values  |R.|.  Since  n  =  ]f^  and  s  • 
=  0,  we  obtain  from  (4),  by  setting  /  =  1, 


(t  X  s)  •  r 

(t  X  s)  •  r  -  |R|  (Wf  •  r)  (s  •  r) 


<  0 


(7) 


From  this  inequality  the  following  constraint  on  |R| 
can  be  derived: 


sgn(t  X  s)  •  r 

=  -sgn((t  X  s) -r- 1R|(w£  •r)(s -r))  (8) 

At  any  point  r  in  the  image  this  constraint  is  ei¬ 
ther  satisfied  for  all  values  |R|,  or  it  is  satisfied  for 
an  interval  of  values  |R|  bounded  from  either  above 
or  below,  or  it  is  not  satisfied  for  any  value  at  all. 
Thus,  (7)  provides  a  classification  of  the  points  on 
the  sphere,  and  we  obtain  four  different  kinds  of  ar¬ 
eas  (types  I-IV),  as  summarized  in  Table  1. 


area 

location 

constraint  on  |R| 

I 

sgn(t  X  s)  •  r  = 
sgn(t  X  s)  •  r  = 
sgn(r  •  W£)(r  •  s) 

mi.  (*  X  s)  •  r 
(r.u,e)(r-s) 

II 

— sgn(t  X  s)  •  r  = 
sgn(t  X  s)  •  r  = 
sgn(r  •  u>e)(r  •  s) 

all  |R| 

III 

sgn(t  X  s)  •  r  = 

— sgn(t  X  s)  •  r  = 
sgn(r  •  We)(r  •  s) 

mi  .  (tXs)T 
(r-w,)(r-s) 

IV 

sgn(t  X  s)  •  r  = 
sgn(t  X  s)  •  r  = 
-sgn(r  •  a)e)(r  •  s) 

none 

Thus  for  any  s,  we  obtain  a  volume  of  negative  range 
values  consisting  of  the  volumes  above  areas  I,  II, 
and  III.  An  illustration  for  both  hemispheres  is  given 
in  Figure  3.  As  can  be  seen,  areas  II  and  III  cover 
the  same  amount  of  area,  which  has  the  size  of  the 
area  between  the  two  great  circles  (t  x  s)  •  r  =  0  and 
(t  X  s)  •  r  =  0,  and  area  I  covers  a  hemisphere  minus 
the  area  between  (t  x  s)  •  r  =  0  and  (t  x  s)  •  r  =  0. 
If  the  scene  in  view  is  unbounded,  that  is,  |R|  G 
[O...00],  there  is  a  range  of  values  |R|  above  any 
point  r  in  areas  I  and  III  which  results  in  nega¬ 
tive  range  estimates.  If  we  consider  a  lower  bound 
IRminl  #  0  and  an  upper  bound  |Rmax|  #  oo.  we 
obtain  two  additional  curves  Cmin  and  Cmax  with 
Cmin  =  (t  X  s)  ■  r  -  iRminI  (we  ■  r)(s  •  r)  =  0  and 
Cmax  =  (txs)-r-|Rmax|(‘*»e-r)(s-r)  =  0  as  bounds 
for  areas  with  negative  range  values  (as  shown  in 
Figure  3).  As  can  be  seen,  the  curves  Cmin  =  0, 
Cmax  =  0,  (t  X  s)  ■  r  =  0  and  (u>,  ■  r)(s  •  r)  =  0 
intersect  at  the  same  point. 


Figure  3:  Classification  of  image  points  according 
to  constraints  on  |R|.  At  C^min  ^max)  |R|  is 
constrained  to  be  greater  (area  I)  or  smaller  (area 
III)  than  |Rmin|  or  |Rmax|-  The  two  hemispheres 
correspond  to  the  front  of  the  sphere  and  the  back 
of  the  sphere,  both  as  seen  from  the  front  of  the 
sphere. 

In  area  I,  we  do  not  obtain  any  volume  of  nega- 
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tive  range  estimates  for  points  r  between  the  curves 
(Wf  •  r)(s  •  r)  =  0  and  Cmax  =  0;  the  volume  for 
points  r  between  Cmin  =  0  and  Cmax  =  0  is  bounded 
from  below  by  |R|  =  (and  from  above 

by  |R-max|),  and  the  volume  for  points  r  between 
Cmin  =  0  and  (t  X  s)  •  r  =  0  extends  from  |Rmin|  to 
|Rmax|-  In  area  III  we  do  not  obtain  any  volume  for 
points  r  between  (t  x  s)  •  r  =  0  and  Cmin  =  0.  The 
volume  for  points  r  between  Cmin  =  0  and  Cmax  =  0 
is  bounded  from  above  by  |R|  =  (and  from 

below  by  |Rmin|)  and  the  volume  for  points  r  be¬ 
tween  Cmax  =  0  and  (wj  •  r)(s  •  r)  =  0  extends  from 
mini  to  |Rmax|- 
We  are  given  and  t,  and  we  are  interested  in  t, 
which  minimizes  the  negative  range  volume.  For  any 
s  the  negative  range  volume  becomes  smallest  if  t  is 
on  the  great  circle  of  t  and  s,  that  is,  (t  x  s)  ■  t  =  0, 
as  will  be  shown  next. 

Let  us  consider  a  t  such  that  (t  x  s)  •  t  0,  and  let 
us  change  t  so  that  (t  x  s)  •  t  =  0.  As  t  changes, 
the  area  of  type  II  on  the  sphere  becomes  an  area 
of  type  IV  and  the  area  of  type  III  becomes  an  area 
of  type  I.  Thus,  the  negative  range  volume  obtained 
consists  only  of  range  values  above  areas  of  type  I. 

Let  us  use  the  following  notation.  denotes 

the  area  which  changes  from  type  III  to  type  I  and 
Vni  and  Vk^iu)  are  the  volumes  before  and  after 
change.  Similarly,  Aji-iv  denotes  the  area  which 
changes  from  type  II  to  type  IV  and  V//  and  Vjv 
are  the  corresponding  volumes. 

The  change  of  t  does  not  have  any  effect  on  the  vol¬ 
umes  above  the  areas  that  did  not  change  in  type. 
However,  the  change  of  t  causes  a  decrease  in  the  vol¬ 
ume  above  the  areas  which  changed  in  type;  Volume 
Vl{ill)  <  Vr/-  Furthermore,  the  normalization  term 
is  the  same  for  points  vi{(px-^,<Pyi)  and  r2(¥^j;3,V’y2) 
symmetric  with  respect  to  the  great  circle  s  •  r  =  0, 
because  ipy^  =  fy^  and  with  I;  G  N. 

Thus  we  encounter  the  same  normalization  factors 
in  areas  Aju-j  and  Au-jv- 

The  volume  of  negative  range  values  for  any  s  is 
smallest  for  (t  x  s)  -  t  =  0,  independent  of  the  range  of 
values  in  which  the  scene  lies.  If  we  assume  an  upper 
bound  |Rmax|  7^  00,  or  a  lower  bound  |Rmin|  ^  0, 
or  both  bounds,  there  exist  points  r  in  areas  I  and 
III  above  which  there  are  no  range  values  which  con¬ 
tribute  to  the  negative  range  volume.  However,  Vji 
is  always  larger  than 

For  any  s  the  smallest  volume  is  obtained  for  s,  t, 
and  t  lying  on  a  great  circle.  Therefore,  in  order  to 
minimize  the  total  negative  range  volume,  we  must 
have  t  =  t. 

Thus,  in  summary,  we  have  shown  that  for  any  given 
rotational  error  We  the  negative  range  volume  is 


smallest  if  the  direction  of  the  actual  translation  and 
the  estimated  translation  coincide,  that  is,  t  =  t. 

3.2  Fixed  translational  error 

The  analysis  investigating  the  smallest  rotational  er¬ 
ror,  given  a  translational  error,  can  be  carried  out 
in  a  way  similar  to  the  one  above.  For  reasons  of 
brevity  only  the  idea  is  outlined  here. 

We  are  given  t  and  t,  and  we  are  interested  in  the 
direction  minimizing  the  negative  range  volume. 
We  choose  the  unit  vectors  t,  t,  and  s  to  lie  in  the 
xz  plane. 

We  decompose  Wf  into  a  component  Wpar  which  lies 
in  the  xz  plane  and  a  component  Wperp  parallel  to 
the  y  axis:  u»£  =  Wpar  +  tUperp.  It  can  be  shown 
that  if  Wperp  =  0,  the  smallest  negative  range  vol¬ 
ume  is  obtained  for  Wpar„  7^  0  parallel  to  the  y  axis 
with  (t  X  Wparo)  =  -(*  X  WparJ.  A  general 
must  satisfy  the  constraint  ■  t)  =  (wt  ■  t),  but 
if  we  change  the  direction  of  a>e,  which  amounts  to 

“  -^(0,-1, 0)>  by  continu¬ 
ously  increasing  A  >  0,  the  negative  depth  volume 
increases  monotonically.  Thus  the  smallest  negative 
depth  volume  is  obtained  for  Wpar^, ,  which  lies  on  the 
geodesic  through  t  and  t  at  an  equal  distance  from 
both. 

4  The  planar  case 

If  we  use  the  more  common  component  notation  and 
express  the  coordinates  of  the  focus  of  expansion  as 
(a^o.J/o)  =  (^>  ^),  and  we  set  W  =  1  and  W  =  1, 
we  obtain 

/  \ 

^  ^  (g  -  ap)  n^  +  {y-  yo)  riy 

{x  -  xo)nx  +  {y  -  yo)ny 
-VZ ^  (t 

+  («e  (^  +  /)  -Tea;)  j 

(9) 

where  Uj,  and  Uy  denote  the  components  of  n  in  the 
X  and  y  directions. 

In  the  following  analysis,  we  perform  some  sim¬ 
plification:  For  a  limited  field  of  view,  the  terms 
quadratic  in  the  image  coordinates,  which  appear  in 
the  rotational  components,  are  small  with  respect  to 
the  linear  and  constant  terms,  and  we  therefore  drop 
them. 

The  flow  directions  {nx,ny)  can  alternatively  be 
written  as  (cos  V’,  sin  V'),  with  ip  £  [Ojap]  denoting 
the  angle  between  [ux,  UyY  and  the  x  axis. 

To  simplify  the  visualization  of  the  volumes  of  neg- 
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ative  depth  in  different  directions,  we  perform  the 
following  coordinate  transformation: 

[x',  y^Y  =  y]^>  [2^0',  yo'Y  =  yoY 

=  R[x,yY,W,^e'Y  =  R[<^e,PeY 


where  R  = 


cos  il>  sin  rp 
—  sin  Ip  cos  tp 


The  0  distortion  surface  and  the  -00  distortion  sur¬ 
face  thus  become 


the  middle  of  the  depth  interval  in  the  plane  Z  = 

ZmKt  +  Zmin 
2 

Since  /?e'  =  cos  ^/?£  -  sin  ipa^  and  xq^  =  cos  i>xo^  + 
sinV’J/o.,  the  volume  is  minimized  for  every  direc¬ 
tion  if  In  other  words,  the  rotational 

yo.  “e 

error  (cce./Je)  and  the  translational  error  (a;o.,yoJ 
have  to  be  perpendicular  to  each  other.  If,  on 
the  other  hand,  we  fix  xq/,  we  obtain  xq/  = 

and  again  =  -f;- 


(x'  —  Xq)  =  0 

and  (x'  —  xo')  +  Z  {—^e'f  +  TeJ/)  =  0 


Part  2  (tc  7^  0) 

If  Ye  7^  0,  the  —00  distortion  surface  becomes 


In  the  following  proof  we  first  consider  the  case  of 
Ye  =  0  and  then  summarize  the  general  case. 

Part  1  (7e  =  0) 

If  Ye  =  0,  the  volume  of  negative  depth  values  for 
every  direction  ip  lies  between  the  surfaces 

(x'  —  Xq)  =  0  and  (x'  —  xq')  —  Pt  fZ  =  0 

The  equation  (x'  -  x'q)  =  0  describes  a  plane  paral¬ 
lel  to  the  i/Z  plane  at  distance  Xq  from  the  origin, 
and  the  equation  (x'  —  xqO  ~  0e'fZ  =  0  describes  a 
plane  parallel  to  the  y'  axis  of  slope  which  in¬ 
tersects  the  x'y'  plane  at  the  x'  coordinate  xo'.  Thus 
we  obtain  a  wedge-shaped  volume  parallel  to  the  y' 
axis.  Figure  4  illustrates  the  volume  through  a  slice 
parallel  to  the  x'Z  plane. 


Figure  4:  Slice  parallel  to  the  x'Z  plane  through 
the  volume  of  negative  estimated  depth  for  a  single 
direction. 

Let  us  denote  the  area  of  this  cross  section  by  Aip.  If 
xo'  lies  between  xq'  -|-  PJ fZ^m  and  xq'  +  /?e  /^max, 


(x'-x{,)  +  Z(-/?e7  +  W)=0 

This  surface  can  be  most  easily  understood  by  slic¬ 
ing  it  with  planes  parallel  to  the  x'y'  plane.  At  every 
depth  value  Z,  we  obtain  a  line  of  slope  which 
intersects  the  x'  axis  in  x'  =  xo'  +  Pc' fZ  (see  Fig¬ 
ure  5). 


Figure  5:  Slices  parallel  to  the  xV  plane  through 
the  0  distortion  surface  (Co)  and  the  —00  distortion 

surfaces  at  depth  values  Z  =  Zmin  {Ci),  Z  = 

(C2),  and  Z  =  Zmax  (C3). 

In  order  to  study  the  smallest  negative  depth  volume 
for  a  given  rotational  error,  we  study  how  the  volume 
changes  as  xo/  changes  from  xq'+PP  f  to 

Xo  -b  pP  +  d.  As  derived  in  [Fermiiller 

and  Aloimonos,  1997],  we  obtain  for  the  smallest 
negative  depth  volume 


Alp  —  X0j^(^max  "b  ^min) 

,72  \  I  ^0/ 

t-^max  ''  ^  p  'f 


(10) 


If  we  fix  Pc'  and  solve  =  0,  we  obtain  Xo/  = 

(Zmax  + -^min),  that  is,  the  0  distortion  sur¬ 
face  has  to  intersect  the  -00  distortion  surface  in 


d  =  Pc'f 


-^{z„ 


+  Zmin) 


and  thus  depends  only  on  the  depth  interval. 

Therefore  we  have  the  constraint  ^ 
a  given  rotational  error  {ac,  Pc,7c),  this  constraint 
defines  the  direction  of  the  FOE  of  the  translational 
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error  on  the  image  plane.  For  a  given  translational 
error  {xo,,yot)  this  constraint  defines  the  direction 
of  the  AOR  of  the  rotational  error  on  the  image.  In 
addition  we  must  have  7^  =  0. 

Next  let  us  fix  only  the  amount  of  translational  er¬ 
ror  (a:  q  -t-2/0  From  the  minimization  of  negative 
depth  volume,  we  obtained  (ij  = 

Substituting  for  ,0/  into  (10),  we  obtain  Aip  as  a 
function  of  xqJ  and  the  depth  interval.  The  neg¬ 
ative  depth  volume  for  every  direction  V'  amounts 
to  Aflf,  where  denotes  the  average  extent  of  the 
wedge-shaped  negative  depth  volume  in  direction 
and  the  total  negative  depth  volume  is  minimized  if 
A^l^dip  is  minimized.  Considering  a  limited  field 
of  view,  this  is  achieved  if  the  actual  and  the  esti¬ 
mated  FOE  lie  on  a  line  passing  through  the  image 
center,  that  is,  ^ 

’  ’  j/o  yo 

5  Shape  estimation  in  the  presence 
of  distortion 

The  above  results  are  of  great  importance  for  the 
analysis  of  shape  estimation.  An  error  of  the  form 
^  =  —  ^  =  guarantees  that  for  the  image  near 
the  fixation  center,  a  shape  map  of  the  scene  is  de¬ 
rived  which  is  very  well  behaved. 

Near  the  image  center  the  image  coordinates  are  very 
small.  Thus  using  (9)  the  distortion  factor  there  can 
be  approximated  by 

^  _  _ xqu-,  -I-  ypUy _ 

~  xon^c  +  yony  +  Zf  -  a^Uy) 

If  =  —  ^  =  St!-  for  any  given  Z.  the  numerator  is 
a  multiple  of  the  denominator  and  thus  the  distor¬ 
tion  factor  is  the  same  for  every  direction  (nx.Uy). 
This  means  that  scene  points  of  the  same  depth 
are  distorted  by  the  same  factor  and  the  computed 
depth  map  has  the  same  level  contours  as  the  actual 
depth  map  of  the  scene.  All  the  distortion  takes 
place  only  in  the  Z  dimension.  Thus  the  resulting 
depth  function  involves  an  affine  transformation. 

6  Conclusions 

The  inherent  confounding  of  the  translational  and 
rotational  parameters  in  the  problem  of  recon¬ 
struction  from  multiple  views  has  been  analyzed. 
The  results  obtained,  besides  their  potential  use  in 
structure-from-motion  algorithms,  also  represent  a 
computational  analysis  comparing  different  eye  con¬ 
structions  in  the  natural  world  and  different  camera 
designs.  The  results  on  the  sphere  demonstrate  that 
it  is  very  easy  for  a  system  with  panoramic  vision  to 
estimate  its  self-motion.  Indeed,  if  the  system  pos¬ 
sesses  an  inertial  sensor  that  provides  its  rotation 


with  some  error,  we  have  shown  that  after  derota¬ 
tion,  a  simple  algorithm  considering  only  translation 
based  on  normal  flow  will  estimate  the  translation 
optimally.  This  suggests  that  spherical  eye  design 
is  optimal  for  flying  systems  such  as  the  compound 
eyes  of  insects  and  the  panoramic  vision  of  birds. 
The  analysis  on  the  plane  reveals  that  for  an  optimal 
configuration  of  errors,  the  estimated  depth  distorts 
only  in  the  z  direction,  with  the  level  contours  of 
the  depth  function  distorting  by  the  same  amount, 
thus  making  it  feeisible  to  extract  meaningful  shape 
representations.  This  suggests  that  the  camera-type 
eyes  of  primates,  with  high  resolution  near  the  cen¬ 
ter,  are  possibly  optimal  for  systems  that  need  good 
shape  computation  capabilities. 
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Abstract 

In  this  work,  we  investigate  the  visual  ap¬ 
pearance  of  real-world  surfaces  and  the  depen¬ 
dence  of  appearance  on  imaging  conditions.  We 
present  a  BRDF  (bidirectional  reflectance  dis¬ 
tribution  function)  database  with  reflectance 
measurements  for  over  60  different  samples, 
each  observed  with  over  200  different  combina¬ 
tions  of  viewing  and  source  directions.  We  fit 
the  BRDF  measurements  to  two  recent  models 
to  obtain  a  BRDF  parameter  database.  These 
BRDF  parameters  can  be  directly  used  for  both 
image  analysis  and  image  synthesis.  Finally,  we 
present  a  BTF  (bidirectional  texture  function) 
database  with  image  textures  from  over  60  dif¬ 
ferent  samples,  each  observed  with  over  200  dif¬ 
ferent  combinations  of  viewing  and  source  di¬ 
rections.  Each  of  these  unique  databases  has 
important  implications  for  a  variety  of  vision 
algorithms  and  each  is  made  publicly  available. 

1  Introduction 

Characterizing  the  appearance  of  real-world 
surfaces  is  important  for  many  computer  vision 
algorithms.  The  appearance  of  any  surface  is 
a  function  of  the  scale  at  which  it  is  observed. 
When  the  characteristic  variations  of  the  sur¬ 
face  are  subpixel,  all  local  image  pixels  have 
the  same  intensity  determined  by  the  surface  re¬ 
flectance.  The  variation  of  radiance  with  view¬ 
ing  and  illumination  direction  is  captured  by 
the  BRDF  {bidirectional  reflectance  distribution 
function).  If  the  characteristic  surface  undula¬ 
tions  are  instead  projected  onto  several  image 
pixels,  there  is  a  local  variation  of  pixel  inten¬ 
sity,  referred  to  as  image  texture.  The  depen¬ 
dency  of  texture  on  viewing  and  illumination 
directions  is  described  by  the  BTF  ( bidirectional 
texture  function).  This  taxonomy  is  illustrated 
in  Figure  1. 
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In  this  work  we  measure  the  BRDF  of  over  60 
samples  of  rough,  real-world  surfaces.  Although 
BRDF  models  have  been  widely  discussed 
and  used  in  vision  (see  [I0],[l6],[l9],[7],[l2]) 
the  BRDFs  of  a  large  and  diverse  collec¬ 
tion  of  real-world  surfaces  have  never  before 
been  obtained.  Our  measurements  comprise 
a  comprehensive  BRDF  database  (the  first 
of  its  kind)  that  is  now  publicly  available 
at  www.cs.columbia.edu/CAVE/curet.  Exactly 
how  well  the  BRDFs  of  real-world  surfaces  fit 
existing  models  has  remained  unknown  as  each 
model  is  typically  verifled  using  a  small  number 
(2  to  6)  of  surfaces.  Our  large  database  allows 
us  to  evaluate  the  performance  of  known  mod¬ 
els.  Speciflcally,  the  measurements  are  fit  to  two 
existing  analytical  representations:  the  Oren- 
Nayar  model  [12]  for  surfaces  with  isotropic 
roughness  and  the  Koenderink  et  al.  decom¬ 
position  [7]  for  both  anisotropic  and  isotropic 
surfaces.  Our  fitting  results  form  a  concise 
BRDF  parameter  database  that  is  also  publicly 
available  at  www.cs.columbia.edu/CAVE/curet. 
These  BRDF  parameters  can  be  directly  used 
for  both  image  analysis  and  image  synthesis.  In 
addition,  the  BRDF  measurements  can  be  used 
to  evaluate  other  existing  models  [I0],[l6],[l9] 
as  well  as  future  models. 

While  obtaining  BRDF  measurements,  images 
of  each  real-world  sample  are  recorded.  These 
images  prove  valuable  since  they  comprise  a 
texture  database,  or  a  BTF  database,  with 
over  12,000  images  (61  samples  with  205  im¬ 
ages  per  sample).  Current  literature  deals  al¬ 
most  exclusively  with  textures  due  to  albedo 
and  color  variations  on  planar  surfaces  (see 
[18],[2],[6]).  In  contrast,  the  texture  due  to 
surface  roughness  has  complex  dependencies  on 
viewing  and  illumination  directions.  These  de¬ 
pendencies  cannot  be  studied  using  existing  tex- 
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Figure  1:  Taxonomy  of  surface  appearance.  When 
viewing  and  illumination  directions  are  fixed,  sur¬ 
face  appearance  can  be  described  by  either  radiance 
(at  coarse-scale  observation)  or  texture  (at  fine-scale 
observation).  When  viewing  and  illumination  direc¬ 
tions  vary,  the  equivalent  descriptions  are  the  bidi¬ 
rectional  reflectance  distribution  function  (BRDF) 
and  the  bidirectional  texture  function  (BTF). 


ture  databases  that  include  few  images  (often 
a  single  image)  of  each  sample  (for  instance, 
the  widely  used  the  Brodatz  database).  Our 
texture  database  covers  a  diverse  collection  of 
rough  surfaces  and  captures  the  variation  of 
image  texture  with  changing  illumination  and 
viewing  directions.  This  database  is  also  avail¬ 
able  at  www.cs.columbia.edu/CAVE/curet. 

The  measurements  and  model  fitting  results  are 
pertinent  to  a  variety  of  areas  including  remote¬ 
sensing,  photogrammetry,  image  understanding 
and  scene  rendering.  Important  implications  of 
this  work  for  computer  vision  are  discussed. 

2  Measurement  Methods 

Our  measurement  device  is  comprised  of  a 
robot^  larnp^,  personal  computer^,  spectrom¬ 
eter"^  and  video  camera®.  Measuring  the  BRDF 
requires  radiance  measurements  for  a  range  of 
viewing/illumination  directions.  For  each  sam¬ 
ple  and  each  combination  of  illumination  and 
viewing  directions,  an  image  from  the  video 
camera  is  captured  by  the  frame  grabber.  These 
images  have  640x480  pixels  with  24  bits  per 
pixel  (8  bits  per  R/G/B  channel).  The  pixel 
values  are  converted  to  radiance  values  using 
a  post-processing  calibration  and  segmentation 
scheme  described  in  [3].  The  calibrated,  seg¬ 
mented  images  serve  as  the  BTF  measurements 

^SCORBOT-ER  V  by  ESHED  Robotec  (Tel  Aviv, 
Israel). 

^Halogen  bulb  with  a  Fresnel  lens. 

®IBM  compatible  PC  running  Windows  3.1  with 
“Videomaker” 

frame  grabber  by  VITEC  Multimedia. 

‘‘SpectraScan  PR-704  by  Photoresearch 
(Chatsworth,CA). 

®Sony  DXC-930  3-CCD  color  video  camera. 


and  these  images  are  averaged  to  obtain  the 
BRDF  measurements. 

The  need  to  vary  the  viewing  and  source  di¬ 
rections  over  the  entire  hemisphere  of  possible 
directions  presents  a  practical  obstacle  in  the 
measurements.  This  difficulty  is  reduced  con¬ 
siderably  by  orienting  the  sample  to  generate 
the  varied  conditions.  As  illustrated  in  Fig¬ 
ure  2,  the  light  source  remains  fixed  through¬ 
out  the  measurements.  The  light  rays  incident 
on  the  sample  are  approximately  parallel  and 
uniformly  illuminate  the  sample.  The  camera 
is  mounted  on  a  tripod  and  its  optical  axis  is 
parallel  to  the  floor  of  the  lab.  During  measure¬ 
ments  for  a  given  sample,  the  camera  is  moved 
to  seven  different  locations,  each  separated  by 
22.5  degrees  in  the  ground  plane  at  a  distance  of 
200  cm  from  the  sample.  For  each  camera  posi¬ 
tion,  the  sample  is  oriented  so  that  its  normal  is 
directed  toward  the  vertices  of  the  facets  which 
tessellate  the  fixed  quarter-sphere  illustrated  in 
Figure  2.  With  this  arrangement,  a  considerable 
number  of  measurements  are  made  in  the  plane 
of  incidence  (i.e.  source  direction,  viewing  direc¬ 
tion  and  sample  normal  lie  in  the  same  plane). 
Also,  for  each  camera  position,  a  specular  point 
is  included  where  the  sample  normal  bisects  the 
angle  between  the  viewing  and  source  direction. 
Sample  orientations  with  corresponding  view¬ 
ing  angles  or  illumination  angles  greater  than 
85  degrees  are  excluded  from  the  measurements 
to  avoid  self-occlusion  and  self-shadowing.  This 
exclusion  results  in  the  collection  of  205  images 
for  each  sample.  For  anisotropic  samples,  the 
205  measurements  are  repeated  after  rotating 
the  sample  about  the  global  normal  by  either  90 
degrees  or  45  degrees,  depending  on  the  struc¬ 
ture  of  the  anisotropy. 

3  Samples  For  Measurements 
The  collection  of  real-world  surfaces  used  in  the 
measurements  are  illustrated  in  Figure  3.  Sam¬ 
ples  of  these  surfaces  were  mounted  on  10x12 
cm  bases  which  were  constructed  to  fit  onto 
the  robot  gripper.  Each  sample,  though  glob¬ 
ally  planar,  exhibits  considerable  depth  vari¬ 
ation  or  macroscopic  surface  roughness.  The 
samples  were  chosen  to  span  a  wide  range  of  ge¬ 
ometric  and  photometric  properties.  The  cate¬ 
gories  include  specular  surfaces  (aluminum  foil, 
artificial  grass),  diffuse  surfaces  (plaster,  con¬ 
crete),  isotropic  surfaces  (cork,  leather,  styro¬ 
foam),  anisotropic  surfaces  (straw,  corduroy, 
corn  husk),  surfaces  with  large  height  vari¬ 
ations  (crumpled  paper,  terrycloth,  pebbles), 
surfaces  with  small  height  variations  (sandpa- 
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29-Polyester  32-Roofing  Shingle  31-Rough  Paper  36-Limestone 


Figure  2:  Illustration  of  the  discrete  sample  orien¬ 
tations,  light  source  and  camera  positions  used  in  the 
measurements.  For  each  of  the  7  camera  positions 
illustrated,  the  robot  orients  the  sample’s  global  nor¬ 
mal  to  the  directions  indicated  by  the  vertices  on  the 
quarter-sphere.  The  illumination  direction  remains 
fixed. 


per,  quarry  tile,  brick),  pastel  surfaces  (paper, 
cotton),  colored  surfaces  (velvet,  rug),  natural 
surfaces  (moss,  lettuce,  fur)  and  man-made  sur¬ 
faces  (sponge,  terrycloth,  velvet). 

4  BRDF  Database 

The  BRDF  measurements  form  a  database  with 
over  12,000  reflectance  measurements  (61  sam¬ 
ples,  205  measurements  per  sample,  205  addi¬ 
tional  measurements  for  anisotropic  samples). 
The  measured  BRDFs  are  quite  diverse  and  re¬ 
veal  the  complex  appearance  of  many  ordinary 
surfaces. 

Figure  4  illustrates  examples  of  spheres  ren¬ 
dered  with  the  measured  BRDF  as  seen  from 
camera  position  1,  i.e.  with  illumination  from 
22.5°  to  the  right.  Interpolation  is  used  to  ob¬ 
tain  a  continuous  radiance  pattern  over  each 
sphere.  The  rendered  sphere  corresponding  to 
velvet  (Sample  7)  shows  a  particularly  interest¬ 
ing  BRDF  that  has  bright  regions  when  the 
global  surface  normal  is  close  to  90  degrees 
from  the  illumination  direction.  This  effect 
can  be  accounted  for  by  considering  the  indi¬ 
vidual  strands  comprising  the  velvet  structure 
which  reflect  light  strongly  as  the  illumination 
becomes  oblique.  This  effect  is  consistent  with 
the  observed  brightness  in  the  interiors  of  folds 
of  a  velvet  sheet.  Indeed,  the  rendered  velvet 
sphere  gives  a  convincing  impression  of  velvet. 
The  rendered  spheres  of  plaster  (Sample  30) 
and  rooflng  shingle  (Sample  32)  show  a  fairly 
flat  appearance  which  is  quite  diflferent  from 
the  Lambertian  prediction  for  such  matte  ob¬ 
jects,  but  is  consistent  with  [ll]  and  [12].  Con¬ 
crete  (Sample  49)  and  salt  crystals  (Sample  43) 
also  show  a  somewhat  flat  appearance,  while 
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43-SaU  Crystals  49-Concr0te  30-Plaster  55-Orange  Peel 


7-Ve1vet  1 5-Aluminum  Foil  19-Rug  27-lnsulation 


41-Brick  54-Wood_a  56-Wood_b  61-Moss 


Figure  4:  Spheres  rendered  using  the  BRDF  mea¬ 
surements  obtained  from  camera  position  1  (illu¬ 
mination  at  22.5°  to  the  right).  Interpolation  was 
used  to  obtain  radiance  values  between  the  measured 
points. 


rough  paper  (Sample  31)  is  more  Lambertian. 
The  plush  rug  (Sample  19)  and  moss  (Sam¬ 
ple  61),  have  similar  reflectance  patterns  as  one 
would  expect  from  the  similarities  of  their  ge¬ 
ometry.  Rendered  spheres  from  two  anisotropic 
samples  of  wood  (Sample  54  and  Sample  56) 
are  also  illustrated  in  Figure  4.  The  struc¬ 
ture  of  the  anisotropy  of  sample  54  consists  of 
horizontally  oriented  ridges.  This  ridge  struc¬ 
ture  causes  a  vertical  bright  stripe  instead  of  a 
specular  lobe  in  the  rendered  sphere.  Sample 
56  shows  a  similar  effect,  but  the  anisotropic 
structure  for  this  sample  consists  of  near  verti¬ 
cal  ridges.  Consequently  the  corresponding  ren¬ 
dered  sphere  shows  a  horizontal  bright  region 
due  to  the  surface  geometry. 

5  Fitting  to  BRDF  Models 

A  concise  description  is  required  for  functional 
utility  of  the  measurements.  We  employ  the 
Oren-Nayar  model  [12]  and  the  Koenderink  et 
al.  representation  [S]  to  obtain  parametric  de¬ 
scriptions  of  the  BRDF  measurement  database. 
The  resulting  database  of  parameters  can  be 
used  directly  and  conveniently  in  a  variety  of  al¬ 
gorithms  where  accurate,  concise  and  analytical 
reflectance  descriptions  are  needed.  In  vision, 
these  applications  include  shape-from-shading 
and  photometric  stereo.  In  computer  graphics, 
the  reflectance  parameters  are  useful  for  realis¬ 
tic  rendering  of  natural  surfaces.  As  with  the 
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47-Aquarium  Stones  48-Brown  Bread 


44-Linen 


56-Wood_b 


52-White  Bread  53-Soieirolia  Plant  54-Wood_a  55-Orange  Peel 


51 -Corn  Husk 
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6-Sandpaper 


7-Velvet 


Figure  3:  The  collection  of  61  real-world  surfaces  used  in  the  measurements.  The  name  and  number  of 
each  sample  is  indicated  above  its  image.  The  samples  were  chosen  to  span  a  wide  range  of  geoinetric  and 
photometric  properties.  The  categories  include  specular  surfaces  (aluminum  foil,  artificial  grass),  diffuse  sur- 
faces  (plaster,  concrete),  isotropic  surfaces  (cork,  leather,  styrofoam),  anisotropic  surfaces  (straw,  (jorduroy, 
corn  husk),  surfaces  with  large  height  variations  (crumpled  paper,  terrycloth,  pebbles),  surfaces  with  small 
height  variations  (sandpaper,  quarry  tile,  brick),  pastel  surfaces  (paper,  cotton),  colored  surfaces  (wlvet, 
rug),  natural  surfaces  (moss,  lettuce,  fur)  and  man-made  surfaces  (sponge,  terrycloth,  velvet).  Different 
samples  of  the  same  type  of  surfaces  are  denoted  by  letters,  e.g.  Brick-a  and  Brick_b.  Samples  29,  30,  31 
and  32  are  close-up  views  of  samples  2,  11,  12  and  14,  respectively. 
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Figure  5:  Cylinder  rendered  with  2D  texture  map¬ 
ping  (left)  and  3D  texture  mapping  (right).  The  2D 
texture  mapping  was  done  by  warping  a  frontal  view 
image  of  the  texture  (with  illumination  at  22.5  de¬ 
grees  to  the  right).  The  3D  texture  mapping  uses  13 
images  from  the  BTF  of  Sample  45  (concrete). 

measurement  database,  the  complete  database 
of  reflectance  parameters  is  also  available  elec¬ 
tronically. 

6  Texture  Database 

The  appearance  of  a  rough  surface,  whether 
manifested  as  a  single  radiance  value  or  as  image 
texture,  depends  on  viewing  and  source  direc¬ 
tion.  Just  as  the  BRDF  describes  the  coarse- 
scale  appearance  of  a  rough  surface,  the  BTF 
(bidirectional  texture  function)  is  useful  for  de¬ 
scribing  the  fine-scale  appearance  of  a  rough 
surface.  Our  measurements  of  image  texture 
comprise  the  first  BTF  database  for  real-world 
surfaces.  The  database  has  over  12,000  images 
(61  samples,  205  measurements  per  sample,  205 
additional  measurements  for  anisotropic  sam¬ 
ples)  . 

Important  observations  on  the  BTF  can  be 
made  from  the  database.  Consider  texture  map¬ 
ping  using  Sample  45  (concrete)  as  shown  in 
Figure  5.  The  differences  in  the  2D  texture- 
mapped  cylinder  and  the  3D  texture-mapped 
cylinder  (using  the  database  images)  are  readily 
apparent.  Because  of  the  varying  surface  nor¬ 
mals  across  the  sample,  foreshortening  effects 
are  quite  complicated  and  cannot  be  accounted 
for  by  common  texture- mapping  techniques.  A 
detailed  discussion  of  the  pitfalls  of  current  tex¬ 
ture  rendering  schemes  is  given  in  [S]. 

Consider  the  same  sample  shown  under  two  dif¬ 
ferent  sets  of  illumination  and  viewing  direc¬ 
tions  in  Figure  6.  The  corresponding  Fourier 
spectra  are  also  shown  in  Figure  6.  Notice  that 
the  spectra  are  quite  different.  Most  of  the  dif¬ 
ference  is  due  to  the  change  in  azimuthal  angle 
of  the  source  direction  which  causes  a  change 
in  the  shadowing  direction  and  hence  a  change 
in  the  dominant  orientation  of  the  spectrum.  If 
the  image  texture  was  due  to  a  planar  albedo 


Figure  6:  Changes  in  the  spectrum  due  to  changes 
in  imaging  conditions.  (Top  row)  Two  images  of 
sample  11  with  different  source  and  viewing  direc¬ 
tions.  (Bottom  row)  Fourier  spectrum  of  the  images 
in  the  top  row,  with  zero  frequency  at  the  center 
and  brighter  regions  corresponding  to  higher  mag¬ 
nitudes.  The  orientation  change  in  the  spectrum  is 
due  to  the  change  of  source  direction  which  causes  a 
change  in  the  shadow  direction. 

or  color  variation,  changes  in  the  source  direc¬ 
tion  would  not  have  this  type  of  effect  on  the 
spectrum.  Source  direction  changes  would  only 
cause  a  uniform  scaling  of  the  intensity  over  the 
entire  image. 

7  Implications  for  Vision 

Our  BRDF  measurement  database  provides  a 
thorough  investigation  of  the  reflectance  proper¬ 
ties  of  real-world  rough  surfaces.  This  database 
fills  a  long-standing  need  for  a  benchmark  to 
test  and  compare  BRDF  models  as  we  have 
done  here  for  the  Oren-Nayar  model  and  the 
Koenderink  et  al.  decomposition. 

Our  BRDF  parameter  database,  obtained  by 
fitting  the  measurements  to  the  Oren-Nayar 
model  and  the  Koenderink  et  al.  decomposi¬ 
tion,  can  be  used  in  place  of  the  popular  Lam¬ 
bertian  reflectance  model  in  such  algorithms  as 
shape-from-shading  [5]  and  photometric  stereo 
[20].  Since  these  algorithms  rely  on  a  reflectance 
model  to  ascertain  shape,  inaccuracies  of  the 
Lambertian  model  can  significantly  affect  their 
performance.  The  model  parameters  can  also  be 
used  instead  of  popular  shading  models  [4],[l] 
for  photorealistic  rendering  of  real-world  sur¬ 
faces. 

Since  the  parameter  database  covers  two  BRDF 
representations,  a  choice  can  be  made  to  balance 
accuracy  and  conciseness.  For  isotropic  surfaces 
the  3-parameter  Oren-Nayar  model  can  be  em¬ 
ployed.  For  isotropic  and  anisotropic  surfaces, 
when  a  richer  description  can  be  afforded,  the  55 
parameter  Koenderink  et  al.  model  can  be  used. 
For  the  61  surfaces  we  have  investigated,  the  pa- 
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rameters  for  both  models  are  readily  available. 
Our  BTF  database  is  the  first  comprehen¬ 
sive  investigation  of  texture  appearance  as  a 
function  of  viewing  and  illumination  direction. 
As  illustrated  in  Figure  5  and  Figure  6,  sur¬ 
face  roughness  causes  notable  effects  on  the 
BTF  which  are  not  considered  by  current  tex¬ 
ture  algorithms.  Present  algorithms  for  shape- 
from-texture  [I3],[l5],[9],  texture  segmentation 
[21], [9]  and  texture  recognition  [14]  are  only 
suitable  for  2D  textures,  i.e.  planar  texture  due 
to  albedo  variation.  Texture  rendering  also 
typically  assumes  a  2D  planar  texture  that  is 
mapped  to  a  3D  surface.  When  the  surface  is 
rough,  the  rendering  tends  to  be  too  flat  and 
unrealistic.  Texture  analysis  and  synthesis  of 
real-world  rough  surfaces  remains  an  important 
unsolved  problem.  The  database  illustrates  the 
need  for  3D  texture  algorithms  and  serves  as  a 
starting  point  for  their  exploration. 

Our  BRDF  measurement  database,  BRDF 
model  parameter  database  and  BTF  measure¬ 
ment  database  together  represent  an  extensive 
investigation  of  the  appearance  of  real-world 
surfaces.  Each  of  these  databases  has  impor¬ 
tant  implications  for  computer  vision. 
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Abstract 

We  propose  an  algorithm  to  automatically  con¬ 
struct  feature  detectors  for  arbitrary  parametric 
features.  In  the  algorithm,  each  feature  is  repre¬ 
sented  as  a  densely  sampled  parametric  manifold 
in  a  low  dimensional  subspace  of  9?^.  Detection 
is  performed  by  projecting  the  brightness  dis¬ 
tribution  around  each  image  pixel  into  the  sub¬ 
space.  If  the  projection  lies  sufficiently  close  to 
the  feature  manifold,  the  feature  is  detected  and 
the  location  of  the  closest  point  on  the  mani¬ 
fold  is  used  to  estimate  the  feature  parameters. 
By  applying  the  algorithm  to  appropriate  fea¬ 
ture  models,  detectors  have  been  constructed  for 
five  parametric  features,  namely,  step  edge,  roof 
edge,  line,  corner,  and  circular  disc. 

1  Introduction 

Many  applications  in  computational  vision  rely 
upon  robust  detection  of  image  features  and  ac¬ 
curate  estimation  of  their  parameters.  Although 
the  standard  example  of  such  a  feature  is  the 
step  edge,  it  is  by  no  means  the  only  feature  of 
interest.  A  comprehensive  list  would  also  include 
lines,  corners,  junctions,  and  roof  edges  ^  as  well 
as  numerous  others.  In  short,  features  may  be 
too  numerous  to  justify  the  process  of  deriving  a 
new  detector  for  each  one.  Our  aim  in  this  pa¬ 
per  is  to  develop  a  single  detection  mechanism 
that  can  be  applied  to  any  parametric  feature. 
Moreover,  we  wish  to  obtain  precise  estimates  of 
feature  parameters,  which  if  recovered  with  pre¬ 
cision  can  be  of  vital  importance  to  higher  levels 
of  visual  processing. 

To  obtain  high  performance  in  both  feature  de¬ 
tection  and  parameter  estimation,  it  is  essential 

‘This  research  was  supported  in  parts  by  ARPA 
Contract  DACA-76-92-C-007,  DOD/ONR  MURI  Grant 
N00014-95-1-0601,  an  NSF  National  Young  Investigator 
Award,  and  the  NTT  Basic  Research  Laboratory. 

^  Given  the  extent  to  which  feature  detection  has  been 
explored,  a  survey  of  the  work  in  this  Eirea  is  well  be¬ 
yond  the  scope  of  this  paper.  In  our  discussion,  we  only 
use  examples  of  previous  detectors  without  attempting 
to  mention  all  of  them.  Further,  we  will  primarily  be  in¬ 
terested  in  examples  that  use  parametric  feature  models 
rather  than  those  based  upon  differential  invariants. 


to  accurately  model  the  features  as  they  appear 
in  the  physical  world.  Hence,  we  choose  not  to 
make  any  simplifying  assumptions  for  analytic  or 
efficiency  reasons,  and  instead  use  realistic  multi¬ 
parameter  feature  models.  Further,  we  give  care¬ 
ful  consideration  to  the  conversion  of  the  contin¬ 
uous  radiance  function  of  the  feature  in  the  world 
to  its  discrete  image. 

Given  a  parametric  model  of  a  feature  and  a 
model  of  the  imaging  system,  we  can  accurately 
predict  the  pixel  brightness  values  in  a  window 
about  an  imaged  feature.  If  we  regard  the  pixel 
brightness  values  as  real  numbers,  we  can  treat 
each  feature  as  corresponding  to  a  parametric 
manifold  in  where  N  is  the  number  of  pix¬ 
els  in  the  window  surrounding  the  feature.  Fea¬ 
ture  detection  is  then  posed  as  finding  the  closest 
point  on  the  manifold  to  the  point  in  corre¬ 
sponding  to  the  pixel  brightness  values  in  a  novel 
image  window.  If  the  closest  manifold  point  is 
near  enough  to  the  novel  point,  we  detect  the  fea¬ 
ture  and  the  exact  location  (parameters)  of  the 
closest  manifold  point  may  be  used  as  estimates 
of  the  parameters  of  the  feature.  This  statement 
of  the  feature  detection  problem  was  first  intro¬ 
duced  by  Hueckel  [l97l]  and  was  subsequently 
used  by  Hummel  [1979]  amongst  others. 

Hueckel  and  Hummel  both  argued  that,  in  order 
to  achieve  high  efficiency,  a  closed  form  solution 
must  be  found  for  (the  parameters  of)  the  closest 
manifold  point.  To  make  their  derivations  pos¬ 
sible  they  used  simplified  feature  models.  Our 
view  of  feature  detection  is  radically  different. 
We  argue  that  the  features  we  wish  to  detect  are 
inherently  complex  visual  entities  and  so  give  up 
all  hope  of  finding  closed-form  solutions  for  the 
best-fit  parameters.  Instead,  we  discretize  the 
search  problem  by  densely  sampling  the  feature 
manifold. 

At  first  glance,  finding  the  closest  sample  point 
may  seem  inefficient  to  the  point  of  impractical- 
ity.  However,  we  will  demonstrate  that  our  ap¬ 
proach  is  very  practical  through  a  combination 
of  normalization,  dimension  reduction  [Nayar  et 
al,  1996],  efficient  heuristic  search  [Baker  et  al., 
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1998],  and  rejection  techniques  [Baker  and  Na- 
yar,  1996b].  Even  in  the  present  unoptimized 
implementation,  feature  detection  and  parame¬ 
ter  estimation  take  only  a  few  seconds  on  a  stan¬ 
dard  single-processor  workstation  when  applied 
to  a  512  X  480  image. 

2  Parametric  Feature  Representation 

2.1  Parametric  Scene  Features 

By  a  scene  feature  we  mean  a  geometric  or  pho¬ 
tometric  phenomenon  that  produces  spatial  ra¬ 
diance  variations  which  can  aid  in  visual  percep¬ 
tion.  The  continuous  radiance  function  of  the 
scene  feature  can  be  written  as  F‘^{x,  y;  q)  where 
(a:,  y)  ^  S  are  points  within  a  feature  window  S 
and  q  are  the  parameters  of  the  feature. 

2.2  Image  Formation  and  Sensing 

Previous  work  on  feature  detection  has  implic¬ 
itly  assumed  that  artifacts  induced  by  the  imag¬ 
ing  system  are  negligible  and  can  be  ignored. 
We  make  our  models  as  precise  as  possible  by 
incorporating  these  effects.  One  such  effect  is 
defocus.  Another  is  that  the  finite  size  of  the 
lens  aperture  causes  the  optical  transfer  func¬ 
tion  to  be  spatially  bandlimited.  Also,  the  fea¬ 
ture  itself,  even  before  imaging,  may  be  some¬ 
what  smoothed  or  rounded.  The  defocus  factor 
can  be  approximated  as  a  pillbox  function  [Born 
and  Wolf,  1965],  the  optical  transfer  function  by 
the  square  of  the  first-order  Bessel  function  of 
the  first  kind  [Born  and  Wolf,  1965],  and  the 
blurring  due  to  imperfections  in  the  feature  by  a 
Gaussian  function  [Koenderink,  1984].  We  com¬ 
bine  all  three  effects  into  a  single  blurring  factor 
that  is  assumed  to  be  a  2-D  Gaussian  function; 


The  continuous  image  on  the  sensor  plane  is  con¬ 
verted  to  a  discrete  image  through  two  processes. 
First,  the  light  flux  falling  within  each  pixel  is 
integrated.  If  the  pixels  are  rectangular  in  struc¬ 
ture  [Barbe,  1980]  [Norton,  1982],  the  averaging 
function  is: 


a{x,y)  = 


— x,  — y) 

Wx  Wy  Wx  Wy 


(2) 


where  px  and  Py  are  the  spacings  between  sam¬ 
ples.  The  final  discrete  image  of  a  feature  may 
then  be  written  as:  F{x,  y,  q)  = 

{  F‘'{x,  y,  q)  *  g{x,  y)  *  a{x,  y)  }  ■  s(a;,  y)  (4) 

where  *  is  the  2-D  convolution  operator.  Since 
the  above  is  a  weighted  sum  of  Dirac  delta  func¬ 
tions,  it  can  be  rewritten  as  F(m,n;q),  where 
(m,  n)  e  S'  are  the  (integral)  pixel  coordinates. 

2.3  Parametric  Feature  Manifolds 

If  the  number  of  pixels  (m,  n)  in  the  window  S  is 
AT,  each  feature  instance  F(m,  n;  q)  may  be  re¬ 
garded  as  a  point  in  Suppose  the  feature  has 
k  parameters:  dim(q)=A:.  Then,  as  the  param¬ 
eters  vary  over  their  ranges,  F(m,n;q)  traces 
out  a  fc-parameter  manifold.  Feature  detection 
is  then  posed  as  finding  the  closest  point  on  the 
feature  manifold  to  the  point  in  correspond¬ 
ing  to  each  window  in  the  image.  If  the  manifold 
is  near  enough,  we  detect  the  feature  and  the  lo¬ 
cation  (parameters)  of  the  closest  manifold  point 
provides  an  estimate  of  the  feature  parameters. 

2.4  Parameter  Normalization 

For  each  feature  instance  F’(m,  n;q)  encoun¬ 
tered,  we  compute  its  mean  pixel  value  /i(q)= 

^nd  its  pixel  variance 

i"(q)  =  II  F(ni,n;q)  -  ^(q)  |1.  We  then  apply 
the  following  brightness  normalization: 

F(m,  n;  q)  =  [F{m,  n;  q)  -  /x(q)]  (5) 

For  all  of  the  features  we  have  considered,  the 
above  normalization  reduces  the  dimensionality 
of  the  feature  manifold  by  two.  This  happens  be¬ 
cause  F(m,n;q)  is  (approximately)  independent 
of  two  of  the  parameters  in  q.  Once  a  feature  has 
been  detected,  p  and  u  can  be  used  to  recover  the 
two  normalized  parameters  [Baker  et  al,  1998]. 

2.5  Dimension  Reduction 

For  several  reasons,  such  as  feature  symme¬ 
tries  and  high  correlation  between  feature  in¬ 
stances  with  similar  parameter  values,  it  is  pos¬ 
sible  to  represent  the  feature  manifold  in  a  low¬ 
dimensional  subspace  of  3?^  without  significant 
loss  of  information^.  If  correlation  between  fea- 


where  Wx  and  Wy  are  the  dimensions  of  the  pixel. 
Next,  the  pixels  are  sampled,  which  we  model  by 
the  rectangular  grid: 

si,x,y)  =  nil{j;x,^y)  (3) 


^This  idea  was  first  explored  in  [Hummel,  1979]. 
Whereas  Hummel  derived  closed-form  solutions  based 
upon  simplistic  feature  models,  our  approach  is  to  use 
elaborate  feature  models  and  numerical  methods.  This 
results  in  higher  precision  and  greater  generality.  A  sim¬ 
ilar  approach  has  been  adopted  in  [Nandy  et  al.,  1996]. 
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ture  instances  is  the  preferred  measure  of  similar¬ 
ity,  the  Karhunen-Loeve  (K-L)  expansion  [Puku- 
naga,  1990],  yields  the  optimal  subspace. 

3  Example  Features 

For  lack  of  space,  we  now  illustrate  the  paramet¬ 
ric  manifold  representations  for  only  1  of  the  5 
features  which  we  constructed  detectors  for.  The 
results  for  the  other  features  are  similar  and  may 
be  found  in  [Baker  et  ai,  1998]. 

3.1  Step  Edge 

Figures  1(a)  and  1(b)  show  isometric  and  plan 
views  of  our  step  edge  model.  It  is  a  general¬ 
ization  of  the  models  used  in  [Hueckel,  1971], 
[Hummel,  1979],  and  [Lenz,  1987].  It  is  particu¬ 
larly  similar  to  the  model  of  [Nalwa  and  Binford, 
1986],  diflfering  only  slightly  in  its  treatment  of 
smoothing  effects. 

The  basis  for  the  2-D  step  edge  model  is  the  1-D 
unit  step  function: 


u{t)  = 


1  if  t  >  0 
0  if  t  <  0 


(6) 


A  step  with  lower  intensity  level  A  and  upper 
intensity  level  A+B  can  be  written  as  A+B-u{t). 
To  extend  to  2-D,  we  assume  that  the  step  edge  is 
of  constant  cross  section,  is  oriented  at  angle  6  to 
the  x-axis,  and  lies  at  distance  p  from  the  origin. 
Then,  the  perpendicular  distance  of  an  arbitrary 
2-D  point  (x,  y)  from  the  step  is  given  by: 


z  =  y  ■  cos  0  —  X  ■  sin  6  —  p  (7) 

Therefore,  an  ideal  step  edge  of  arbitrary  orien¬ 
tation  and  displacement  from  the  origin  is  given 
by  the  2-D  function  A  +  B- u(z).  For  the  reasons 
given  in  Section  2.2  we  incorporate  the  Gaussian 
blurring  function,  the  pixel  averaging  function, 
and  the  sampling  function.  Finally,  the  step  edge 
model  is:  FsE{x,y‘,  A,B,9,  p,a)  = 

{{A  +  B-u{z))*g{x,y;a)*  a{x,y)}  .s{x,y)  (8) 

where  2;  is  given  by  Equation  (7). 

The  step  edge  model  has  5  parameters,  namely, 
orientation  6,  localization  p,  blurring  or  scal¬ 
ing  a,  and  the  brightness  values  A  and  B.  The 
orientation  parameter  9  is  drawn  from  [0°,  360°]. 
We  restrict  the  localization  parameter  p  to  lie  in 
[— 1/\/2,  l/\/2],  since  any  edge  must  pass  closer 
than  l/\/2  pixels  from  the  center  of  at  least 


(c)  First  8  eigenvectors 


(d)  Decay  of  the  K-L  residue 


(e)  Step  edge  parametric  manifold 


Figure  1:  The  step  edge  model  includes  two  con¬ 
stant  intensity  regions  of  brightness  A 
and  A+B.  Its  orientation  and  intrapixel 
displacement  from  the  origin  are  given 
by  the  parameters  9  and  p  respectively. 
The  fifth  parameter  (not  shown)  is  the 
blurring  factor  cr.  The  K-L  residue  plot 
shows  that  90%  of  the  edge  image  con¬ 
tent  is  preserved  by  the  first  3  eigenvec¬ 
tors.  The  step  edge  manifold  is  parame¬ 
terized  by  orientation  and  intrapixel  lo¬ 
calization  for  a  fixed  blurring  value  and 
is  displayed  in  a  3-D  subspace. 


1427 


one  pixel  in  the  image.  The  blurring  parameter 
o  €  [0.3, 1.5].  The  intensity  parameters  A  and  B 
are  free  to  take  any  value  because  of  the  normal¬ 
ization  described  in  Section  2.4.  The  structure  of 
a  normalized  step  edge  is  independent  of  A  &:  B 
and  is  uniquely  determined  by  the  parameters  0, 
p,  and  a.  Further,  the  values  of  A  and  B  may  be 
recovered  from  the  values  of  p  and  u  calculated 
during  normalization  [Baker  et  al,  1998]. 

The  window  chosen  for  our  edge  model  is  a  49 
pixel  disc  to  avoid  unnecessary  non-linearities  in¬ 
duced  by  a  square  window.  The  results  of  apply¬ 
ing  the  Karhunen-Loeve  expansion  are  displayed 
in  Figures  1(c)  and  1(d).  In  Figure  1(c)  we  dis¬ 
play  the  8  most  important  eigenvectors,  ranked 
by  their  eigenvalues.  The  similarity  between  the 
first  4  eigenvectors  and  the  ones  derived  in  [Hum¬ 
mel,  1979]  is  immediate.  On  closer  inspection, 
however,  we  notice  that  while  Hummel’s  eigen¬ 
vectors  are  radially  symmetric,  the  ones  we  com¬ 
puted  are  not.  This  is  to  be  expected  since  the 
introduction  of  the  parameters  p  and  a  breaks 
the  radial  symmetry  in  Hummel’s  edge  model. 

In  Figure  1(d),  the  decay  of  the  Karhunen-Loeve 
residue  (sum  of  eigenvalues  discarded)  is  plotted 
as  a  function  of  the  number  of  eigenvectors.  To 
reduce  the  residue  to  10%  we  need  to  use  3  eigen¬ 
vectors.  To  reduce  it  further  to  2%  we  need  8 
eigenvectors.  Figure  1(d)  illustrates  a  significant 
data  compression  factor  of  5-15  times.  As  a  re¬ 
sult,  feature  detection  is  made  far  more  efficient. 

The  step  edge  manifold  is  displayed  in  Fig¬ 
ure  1(e).  Naturally,  we  are  only  able  to  display 
a  projection  of  it  into  a  3-D  subspace.  This  sub¬ 
space  is  the  one  spanned  by  the  3  most  important 
eigenvectors.  For  clarity,  we  only  display  a  2  pa¬ 
rameter  slice  through  the  manifold,  obtained  by 
keeping  a  constant  while  varying  9  and  p. 

4  Feature  Detection 

Given  a  point  in  corresponding  to  the  pixel 
intensity  values  in  a  novel  feature  window,  fea¬ 
ture  detection  requires  finding  the  closest  point 
on  the  parametric  manifold.  If  the  distance  be¬ 
tween  the  novel  point  and  the  closest  manifold 
point  is  sufficiently  small,  we  declare  the  pres¬ 
ence  of  the  feature.  The  parameters  of  the  clos¬ 
est  manifold  point  are  then  used  as  estimates  of 
the  scene  feature’s  parameters.  If  the  distance 
between  the  novel  point  and  the  manifold  is  too 
large,  we  assert  the  absence  of  the  feature. 

We  approximate  the  closest  manifold  point  by 
densely  sampling  the  manifold  and  then  per¬ 


forming  a  search  for  the  closest  sample  point.  So 
long  as  we  sample  densely  enough,  this  yields  a 
sufficiently  good  estimate  of  the  closest  manifold 
point.  We  search  using  a  heuristic  coarse-to-fine 
search  which  takes  advantage  of  the  relatively 
smooth  manifolds  [Baker  et  al,  1998]. 

As  an  example  of  the  search  complexity  for  the 
step  edge  model,  if  we  sample  9  every  1.6°,  p 
every  0.088  pixel,  and  o  every  0.14  pixel,  we 
have  46,368  sample  points.  Then,  in  a  10-D  sub¬ 
space,  the  complete  time  to  perform  normaliza¬ 
tion,  projection,  and  search  is  around  1ms  per 
image  window  on  a  DEC  Alpha  3600.  For  a  512 
X  480  image  complete  processing  takes  around  4 
minutes.  However,  by  applying  rejection  tech¬ 
niques  such  as  [Baker  and  Nayar,  1996a]  the 
overall  time  can  be  reduced  to  under  30secs. 

5  Experimental  Results 
5.1  Feature  Detection  Rates 

We  statistically  compare  our  step  edge  detector 
with  the  Canny  [1986]  and  Nalwa-Binford  [1986] 
detectors,  following  the  approach  in  [Nalwa  and 
Binford,  1986].  (See  [Baker  et  al,  1998]  for 
more  details.)  Since  we  took  great  care  mod¬ 
eling  both  the  features  and  the  imaging  system, 
we  used  our  step  edge  model  to  generate  ideal 
step  edges.  For  fairness,  however,  we  changed 
the  details  slightly.  Both  the  Canny  and  Nalwa- 
Binford  detectors  assume  a  constant  blur/scale, 
so  we  fixed  the  value  of  a  in  the  step  edge  model 
to  be  0.6  pixels.  Secondly,  the  Nalwa-Binford 
detector  is  based  on  a  square  5x5  window,  as 
is  Canny  in  the  implementation  that  we  used. 
Hence,  we  changed  the  window  of  our  detector  to 
be  a  square  window  containing  25  pixels,  rather 
than  the  49  pixel  disc  window  used  earlier.  We 
generate  “not  edges”  exactly  as  in  [Nalwa  and 
Binford,  1986],  by  taking  a  constant  intensity 
window,  and  adding  zero-mean  Guassian  noise. 

In  Figure  2  we  compare  the  detection  perfor¬ 
mance  of  the  three  edge  detectors.  For  each 
pair  of  S.N.R.  and  detector,  we  plot  a  curve  of 
false  positives  against  false  negatives  obtained  by 
varying  the  threshold  inherent  in  each  detection 
algorithm.  The  Canny  operator  thresholds  on 
the  gradient  magnitude,  the  Nalwa-Binford  de¬ 
tector  thresholds  on  the  estimated  step  size,  and 
our  approach  thresholds  on  the  distance  from 
the  parametric  manifold.  The  rate  of  false  posi¬ 
tives  was  estimated  by  applying  each  detector  to 
a  constant  intensity  window  with  noise  added. 
The  rate  of  false  negatives  is  obtained  by  apply¬ 
ing  the  detectors  to  noisy  ideal  step  edges. 
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Figure  2:  A  comparison  of  edge  detection  rates. 

The  Canny  (C),  Nalwa-Binford  (N-B), 
and  parametric  manifold  (PM)  detectors 
are  compared  for  S.N.R.  =  1.0  and  2.0. 
We  plot  false  positives  against  false  neg¬ 
atives.  For  each  detector  and  S.N.R., 
the  result  is  a  curve  parameterized  by 
the  threshold  inherent  in  that  detector. 
The  closer  a  curve  lies  to  the  origin,  the 
better  the  performance.  We  see  that  the 
Canny  detector  and  the  parametric  man¬ 
ifold  technique  perform  comparably. 

The  closer  a  curve  lies  to  the  origin  in  Fig¬ 
ure  2,  the  better  the  performance.  Hence,  we 
can  see  that  both  the  Canny  detector  and  our 
detector  do  increasingly  well  as  the  S.N.R.  in¬ 
creases.  The  results  for  the  Nalwa-Binford  de¬ 
tector  are  consistent^  with  those  described  in 
[Nalwa  and  Binford,  1986].  Applied  to  real  im¬ 
ages,  the  Nalwa-Binford  detector  does  not  per¬ 
form  as  poorly  as  Figure  2  might  indicate.  The 
poor  Nalwa-Binford  results  are  probably  due  to 
thresholding  on  the  step-size  and  may  well  be 
completely  different  if  we  fix  the  step-size  thresh¬ 
old,  and  vary  the  tanh-fit  threshold. 

5.2  Parameter  Estimation  Accuracy 

Again  following  [Nalwa  and  Binford,  1986],  we 
analyze  parameter  estimation  accuracy  by  ran¬ 
domly  generating  a  set  of  feature  parameters, 
synthesizing  a  feature  with  these  parameters, 
adding  noise,  applying  the  detector,  and  then 
measurings  the  accuracy  of  the  estimated  param¬ 
eters.  In  Figure  3,  we  compare  the  performance 
of  our  step  edge  detector  with  that  of  the  Canny 
detector  [1986]  and  the  Nalwa-Binford  [1986]  de¬ 
tector.  In  the  figure,  we  plot  the  R.M.S.  error 
in  the  estimate  of  the  orientation  0  against  the 
S.N.R.  We  see  that  for  low  S.N.R.  the  perfor- 

®We  did  not  use  step  2)'  of  the  Nalwa-Binford  algo¬ 
rithm,  however  the  inclusion  of  this  step  does  not  radi¬ 
cally  alter  the  performance  [Nalwa  and  Binford,  1986]. 


Figure  3:  A  comparison  of  the  orientation  estima¬ 
tion  accuracy  for  the  three  step  edge  de¬ 
tectors.  We  took  synthesized  step  edges, 
added  noise  to  them,  and  then  applied 
the  edge  detectors.  We  plot  the  R.M.S. 
error  of  the  orientation  estimate  against 
the  S.N.R.  At  all  noise  levels,  the  para¬ 
metric  manifold  approach  slightly  out¬ 
performs  both  the  Nalwa-Binford  and 
Canny  detectors. 

mance  of  all  detectors  is  limited  by  the  noise. 
For  lower  noise  levels,  our  detector  marginally 
out-performs  both  of  the  other  detectors. 

5.3  Application  to  Images 

In  Figures  4(b)  and  (c)  we  present  the  results 
of  applying  our  step  edge  and  corner  detectors 
to  the  image  in  Figure  4(a).  The  original  image 
is  taken  from  [MOM A,  1984]  and  was  digitized 
using  an  Envisions  6600S  scanner  at  200dpi.  We 
present  the  outputs  of  the  detectors  as  grey- 
coded  distance  to  the  feature  manifold  (on  a  non¬ 
linear  scale)  so  that  the  structure  of  the  object 
can  be  seen  clearly.  It  is  immediate  that  the 
features  detected  are  consistent  with  the  orig¬ 
inal  image.  Thresholding  on  the  distance  to 
the  feature  manifold  to  finally  detect  features  is 
straightforward  as  is  demonstrated  in  [Baker  et 
al.,  1998]  where  we  superimpose  thresholded  fea¬ 
ture  maps  on  the  original  images. 
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Abstract 

Conventional  video  cameras  have  limited  fields  of 
view  that  make  them  restrictive  in  certain  vision 
applications.  A  catadioptric  sensor  uses  a  combi¬ 
nation  of  lenses  and  mirrors  placed  in  a  carefully 
designed  configuration  to  capture  a  much  wider 
field  of  view.  In  particular,  the  shape  of  the  mir¬ 
ror  must  be  selected  to  ensure  that  the  complete 
catadioptric  system  has  a  single  etfective  view¬ 
point,  which  is  a  requirement  for  the  generation 
of  pure  perspective  images  from  the  sensed  image. 
In  this  paper,  we  derive  and  analyze  the  complete 
class  of  single-lens  single-mirror  catadioptric  sen¬ 
sors  which  satisfy  the  fixed  viewpoint  constraint. 
Some  solutions  turn  out  to  be  degenerate  with 
no  practical  value  while  other  solutions  lead  to 
realizable  sensors. 

1  Introduction 

Conventional  imaging  systems  are  limited  in  their 
fields  of  view.  An  eifective  way  to  enhance  the 
field  of  view  is  to  use  mirrors  in  conjunction  with 
lenses.  This  approach  to  image  formation  is  fast 
gaining  in  popularity  (see  [Nayar,  1988],  [Yagi 
and  Kawato,  1990],  [Hong,  1991],  [Goshtasby  and 
Gruver,  1993],  [Yamazawa  et  ai,  1993],  [Nalwa, 
1996]  [Nayar,  1997]).  We  refer  to  the  general  ap¬ 
proach  of  incorporating  mirrors  into  conventional 
imaging  systems  as  catadioptric^  image  forma¬ 
tion.  Our  recent  work  in  this  context  has  led 
to  the  development  of  a  truly  omnidirectional 
video  camera  with  a  spherical  field  of  view  [Na¬ 
yar,  1997]. 
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^Dioptrics  is  the  optics  of  refracting  elements  (say, 
lenses)  whereas  catoptrics  is  the  optics  of  reflecting  sur¬ 
faces  (mirrors).  The  combination  of  refracting  and  re¬ 
flecting  elements  is  referred  to  as  catadioptrics  [Hecht  and 
Zajac,  1974]. 


As  recently  noted  in  [Yamazawa  et  ai,  1993], 
[Nalwa,  1996]  and  [Nayar,  1997],  it  is  highly  desir¬ 
able  that  the  catadioptric  system  (or  any  imag¬ 
ing  system,  for  that  matter)  have  a  single  cen¬ 
ter  of  projection  (viewpoint).  A  single  viewpoint 
permits  the  creation  of  pure  perspective  images 
from  the  image  sensed  by  the  catadioptric  sys¬ 
tem.  This  is  done  by  mapping  sensed  brightness 
values  onto  a  plane  placed  at  any  distance  (effec¬ 
tive  focal  length)  from  the  viewpoint.  Any  image 
computed  in  this  manner  preserves  linear  per¬ 
spective  geometry.  For  instance,  straight  lines  in 
the  scene  produce  straight  lines  in  the  computed 
image.  Images  that  adhere  to  perspective  projec¬ 
tion  are  desirable  from  two  standpoints;  they  are 
consistent  with  the  way  we  are  used  to  seeing  im¬ 
ages,  and  they  lend  themselves  to  further  process¬ 
ing  by  the  large  body  of  work  in  computational 
vision  that  assumes  linear  perspective  projection. 
When  the  catadioptric  system  is  omnidirectional 
in  its  field  of  view,  the  single  viewpoint  permits 
the  construction  of  not  only  perspective  but  also 
panoramic  images. 

In  this  paper,  we  derive  the  complete  set  of 
catadioptric  systems  with  a  single  effective  view¬ 
point  and  which  are  constructed  from  a  single 
conventional  lens  and  a  single  mirror.  As  we  will 
show,  the  class  of  mirrors  which  can  be  used  is  ex¬ 
actly  the  class  of  rotated  (swept)  conic  sections. 
Within  this  class  of  solutions,  several  swept  con¬ 
ics  prove  to  be  degenerate  solutions  and  hence  im¬ 
practical,  while  others  lead  to  realizable  sensors. 
During  our  analysis  we  wiU  stop  at  many  points 
to  evaluate  the  merits  of  the  solutions  as  well  as 
the  merits  of  catadioptric  sensors  proposed  in  the 
literature. 

2  General  Solution 

Let  the  final  (dioptric)  stage  of  our  sensor  be  a 
conventional  perspective  lens.  In  Figure  1,  the  ef¬ 
fective  pinhole  of  the  lens  is  p.  We  formulate  the 
catadioptic  image  formation  problem  as  follows: 
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Find  the  class  of  reflecting  surfaces  that,  when 
used  in  conjunction  with  a  perspective  lens,  pro¬ 
duce  an  image  of  the  world  as  seen  from  a  fixed 
viewpoint.  Let  us  assume  that  the  fixed  view¬ 
point  V  is  at  the  origin  of  the  coordinate  frame 
(see  Figure  1)  and  the  center  p  of  the  perspective 
lens  is  located  on  the  vertical  axis  at  a  distance 
c  from  V. 

image  plane 


plane  is  one-to-one. 

With  the  above  generalizations  in  place,  we 
are  ready  to  derive  the  profile  of  the  reflecting 
surface.  The  relation  between  the  angle  9  of  the 
incoming  ray  and  the  reflecting  surface  is 

tan0  =  -  .  (1) 

r 

The  angle  a  made  by  the  reflected  ray  with  the 
horizontal  axis  is  given  by 


Figure  1:  Geometry  used  to  derive  the  reflecting  sur¬ 
face  that  produces  an  image  of  the  world  as  seen  from 
a  flxed  viewpoint  v.  This  image  is  captured  using 
a  conventional  perspective  camera  with  an  effective 
pinhole  p. 


tana 


(c-z) 


(2) 


Let  the  surface  slope  at  the  point  of  reflection 
be  defined  by  the  angle  /?  made  by  the  surface 
normal  with  the  vertical  axis: 


dz 
dr 

This  allows  us  to  write 
2  tan  /? 


-  tan  /3 


tan  213  = 


_ o  dz 

^  dr 


1  —  tan^^ 


(dz\2 


(3) 


(4) 


Since  the  surface  is  specular,  the  angles  of  in¬ 
cidence  and  reflection  are  equal.  Consequently, 


(5) 


which  gives  us 


For  the  fixed  viewpoint  constraint  to  hold,  each 
world  point  seen  from  v  must  be  reflected  by  a 
point  on  the  mirror  surface  S{x,y)  towards  p. 
Note  that  since  perspective  projection  is  rota- 
tionaUy  symmetric  about  the  optical  axis  z,  the 
mirror  can  be  assumed  to  be  a  surface  of  rev¬ 
olution  around  z.  Therefore,  it  suffices  to  find 
the  one- dimensional  profile  z{r)  =  S{x,y),  where 
r  =  \/x‘^  +  ?/2. 

In  fact,  the  viewpoint  v  and  the  determined 
profile  z{r)  are  in  no  way  restricted  to  the  op¬ 
tical  axis.  Since  perspective  projection  is  rota- 
tionaUy  symmetric  with  respect  to  any  ray  that 
passes  through  the  pinhole  p,  the  viewpoint  and 
the  profile  could  be  moved  from  the  optical  axis 
by  keeping  the  distance  c  the  same  and  aligning 
the  symmetry  axis  of  the  profile  with  the  ray  that 
passes  through  the  viewpoint  and  the  pinhole.  Of 
course,  in  this  case,  the  image  plane  shown  in  Fig¬ 
ure  1  would  be  non-frontal.  This  does  not  pose 
any  additional  ambiguity  as  the  mapping  from 
any  non-frontal  image  plane  to  a  frontal  image 


tan  2/3 


tana  —  tan0 
1  -t-  tana  tan0 


(6) 


In  the  right  hand  side  of  the  above  expression,  we 
substitute  (1)  and  (2)  and  equate  with  the  right 
hand  side  of  (4)  to  get 


-2f  ^  {c  -  2z)r 

1  -  -  z2)  • 

Thus,  we  find  that  the  reflecting  surface  must 
satisfy  the  above  quadratic  first-order  differen¬ 
tial  equation.  It  is  straightforward  to  solve  the 
quadratic  for  surface  slope: 

^  _  (z2  -  -  cz)  ±  -f  {z^  +  r^  -  czf 

dr  “  r{2z  -  c) 

(8) 

Next,  we  substitute  y  =  z  —  c/2  and  set  b  =  c/2 
which  yields 

dy  (y^  -r^  -  6^)  ±  •v/4r^6^  -|-  +  r"^  -  h'^y 

dr  ~  2ry 

(9) 
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Then,  substituting  2rx  =  —  6^,  we  get 

I  dx  ,1  . 

\/b^  +  a;2  dr  r  ^  ' 

Integrating  both  sides  results  in 

In  (x  +  \/b'^  +  =  ±lnr  +  C  (11) 

where,  C  is  the  constant  of  integration.  Hence, 

X  +  -f-  a;2  =  (12) 


where,  k  =  26*^  >  0  is  a  constant.  By  back  sub¬ 
stituting  and  simplifying  we  arrive  at  two  equa¬ 
tions  which  comprise  the  general  solution: 


Together,  these  two  expressions  represent  the  en¬ 
tire  class  of  mirrors  that  satisfy  the  fixed  view¬ 
point  constraint.  Again,  since  perspective  pro¬ 
jection  is  symmetric  about  any  ray  that  passes 
through  the  pinhole,  the  viewpoint  v  and  the  cor¬ 
responding  mirror  are  in  no  way  restricted  to  the 
optical  axis. 

3  Specific  Mirrors 

A  quick  glance  at  the  forms  of  equations  (13) 
and  (14)  reveals  that  the  mirror  profiles  are  conic 
sections.  However,  each  conic  section  must  be 
placed  at  a  specific  distance  from  the  pinhole. 
As  we  shall  see,  though  aU  our  conic  sections  are 
theoretically  valid,  many  prove  to  be  impracti¬ 
cal  and  only  a  few  lead  to  useful  solutions.  We 
are  now  in  a  position  to  evaluate  several  specific 
cases. 

3.1  Planes 

In  solution  (13),  if  we  set  k  =  2,  we  get  the  cross- 
section  of  a  planar  mirror: 

^  =  I  .  (15) 

As  shown  in  Figure  2,  the  plane  bisects  the 
line  segment  joining  the  pinhole  and  the  view¬ 
point.  This  result  is  easily  generalized  to  arbi¬ 
trary  planes  or  viewpoints.  For  any  plane  with 


unit  normal  n  and  any  point  q  on  it,  the  view¬ 
point  is  simply  the  reflection  of  the  pinhole 

V  =  p-2((p-q)-n)n.  (16) 

Equivalently,  for  any  desired  viewpoint,  points  x 
on  the  planar  mirror  are  given  by 

(x-fe^)-(p-v)  =  0.  (17) 

These  expressions  lead  us  to  a  simple  but  unfor¬ 
tunate  theorem:  For  a  single  fixed  pinhole,  no  two 
planar  mirrors  can  share  the  same  viewpoint,  and 
equivalently,  two  different  viewpoints  cannot  be 
generated  by  the  same  planar  mirror.  It  is  clear 
from  Figure  2  that  a  single  planar  mirror  does  not 
enhance  the  field  of  view  of  the  imaging  system. 
At  the  same  time,  the  above  theorem  makes  it 
impossible  to  increase  the  field  of  view  by  pack¬ 
ing  a  large  number  of  planar  mirrors  (pointing 
in  different  directions)  in  front  of  a  conventional 
imaging  system.  On  the  brighter  side,  the  two 
views  of  a  scene  needed  for  stereo  can  be  cap¬ 
tured  by  a  single  lens  and  two  planar  mirrors,  as 
shown  in  [Goshtasby  and  Gruver,  1993]. 


Figure  2:  A  planar  mirror  must  bisect  the  segment 
joining  the  pinhole  and  the  desired  viewpoint.  Since 
two  planar  mirrors  cannot  generate  the  same  view¬ 
point,  multiple  planar  mirrors  cannot  be  used  to  en¬ 
hance  the  field  of  view  of  a  conventional  imaging  sys¬ 
tem. 

To  ensure  a  single  viewpoint  while  using  mul¬ 
tiple  planar  mirrors,  Nalwa  [Nalwa,  1996]  has  ar- 
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rived  at  a  clever  design  that  includes  four  pla¬ 
nar  mirrors  that  form  the  faces  of  a  pyramid. 
Four  separate  imaging  systems  are  used,  each  one 
placed  above  one  of  the  faces  of  the  pyramid.  The 
optical  axes  of  the  imaging  systems  and  the  an¬ 
gles  made  by  the  four  planar  faces  are  adjusted 
so  that  the  four  viewpoints  produced  by  the  pla¬ 
nar  mirrors  coincide.  The  result  is  a  sensor  that 
has  a  single  viewpoint  and  a  panoramic  field  of 
view  of  approximately  360°  x  50° .  The  panoramic 
image  is  of  relatively  high  resolution  as  it  is  a  con¬ 
catenation  of  four  images  provided  by  four  non¬ 
overlapping  imaging  systems.  Nalwa’s  sensor  is 
easy  enough  to  implement  but  requires  the  use 
of  four  of  each  component  (cameras,  lenses,  and 
digitizers). 

3.2  Cones 

In  solution  (13),  if  we  set  c  =  0,  the  result  is  a 
conical  mirror  with  cross-section 


k  >2  .  (18) 

The  angle  at  the  apex  of  the  cone  varies  with  k. 
At  first  glance,  this  may  seem  like  a  reasonable 
solution.  However,  since  c  =  0,  the  apex  of  the 
cone  must  be  at  the  pinhole.  This  implies  that 
the  rays  of  light  entering  the  pinhole  can  only 
graze  the  cone  and  do  not  represent  reflections 
of  the  world  (see  Figure  3).  Hence,  we  have  a 
degenerate  solution  that  is  of  no  practical  value. 

Indeed,  the  cone  has  been  used  for  wide-angle 
imaging,  in  particular,  for  autonomous  naviga¬ 
tion  [Yagi  and  Kawato,  1990].  In  these  imple¬ 
mentations,  the  apex  of  the  cone  was  placed  at 
a  distance  from  the  pinhole.  In  such  cases,  it  is 
easy  to  show  that  the  viewpoint  is  no  longer  a 
single  point  but  rather  a  locus  [Nalwa,  1996].  If 
the  axis  of  the  cone  points  in  the  direction  of  the 
pinhole,  the  locus  is  a  circle  that  hangs  like  a  halo 
around  the  cone. 

3.3  Spheres 

In  solution  (14),  if  we  set  c  =  0,  we  get  a  spherical 
mirror  with  cross-section 

..2  +  r-2  =  ^  ,  A;  >  0  .  (19) 

Like  the  cone,  this  proves  to  be  a  solution  of  lit¬ 
tle  value;  since  the  viewpoint  and  pinhole  must 


Figure  3:  The  conical  mirror  has  its  apex  at  the  pin¬ 
hole.  This  degenerate  solution  is  of  little  practical 
value.  If  the  apex  is  moved  away  from  the  pinhole, 
the  viewpoint  is  no  longer  single  but  rather  lies  on  a 
circular  locus. 

coincide,  the  observer  sees  itself  and  nothing  else, 
as  shown  in  Figure  4. 


Figure  4:  A  spherical  mirror  produces  a  single  view¬ 
point  only  when  the  pinhole  lies  at  its  center.  This, 
again,  is  a  solution  of  little  use  as  the  observer  sees 
itself  and  nothing  else. 

In  [Hong,  1991],  a  wide-angle  implementation 
with  a  sphere  is  described  that  was  used  for  land¬ 
mark  navigation.  In  this  case,  the  sphere  was 
placed  at  a  distance  from  the  effective  pinhole  of 
the  camera.  As  with  the  cone,  the  result  is  a  lo¬ 
cus  of  viewpoints  rather  than  a  single  viewpoint. 
The  locus  in  the  case  of  the  sphere  can  turn  out 
to  be  a  surface  of  large  extent,  depending  on  the 
distance  between  the  center  of  the  sphere  and  the 
pinhole.  In  [Nayar,  1988],  a  stereo  system  is  pro¬ 
posed  that  uses  a  single  image  of  two  specular 
spheres  to  compute  depth.  In  this  case,  the  sin- 


1434 


gle  viewpoint  constraint  is  not  critical  as  stereo 
requires  multiple  viewpoints. 

3.4  Ellipsoids 

In  solution  (14),  when  c  >  0  and  fc  >  0,  we  get 
an  ellipsoid  with  cross-section 


We  have  now  arrived  at  a  solution  that  can  be 
used  to  enhance  the  field  of  view.  As  shown  in 
Figure  5,  the  viewpoint  v  and  pinhole  p  are  lo¬ 
cated  at  the  two  foci  of  the  ellipse,  respectively. 
If,  for  instance,  the  section  of  the  ellipse  that  lies 
beneath  the  viewpoint  is  used,  the  effective  field 
of  view  (ignoring  self-occlusion  by  the  lens)  cor¬ 
responds  to  the  upper  hemisphere.  It  is  easy  to 
see  that  terminating  the  ellipse  below  the  view¬ 
point  does  not  enhance  the  field  of  view.  Unfor¬ 
tunately,  extending  the  mirror  above  the  view¬ 
point  does  to  help  either;  in  this  case,  rays  of 
light  entering  the  pinhole  would  have  undergone 
multiple  reflections  by  the  mirror.  Yet,  the  el¬ 
lipse  does  represent  our  first  useful  solution.  It  is 
similar  in  nature  to  our  next  solution,  the  hyper¬ 
boloid.  Hence,  we  shall  defer  our  discussion  on 
implementation  issues  related  to  the  ellipsoid. 


3.5  Hyperboloids 

In  solution  (13),  when  c  >  0  and  A:  >  2,  we  get  a 
hyperboloid^  with  cross-section 


(22) 


where 


Figure  5:  A  ellipsoidal  mirror  is  a  viable  solution 
when  the  pinhole  and  the  viewpoint  are  located  at  its 
two  foci,  respectively.  If  the  ellipsoid  is  terminated 
by  a  plane  passing  through  the  viewpoint,  the  field  of 
view  corresponds  a  hemisphere. 


As  we  see  in  Figure  6,  in  the  limit  k  2,  the 
hyperboloid  flattens  to  yield  the  planar  solution 
of  section  3.1.  Ask  increases,  the  curvature  of  the 
hyperboloid  increases,  and  hence  also  the  field  of 
view  of  the  catadioptric  system.  The  two  foci  of 


Figure  6:  The  hyperboloidal  mirror  can  produce  the 
desired  increase  in  field  of  view.  The  pinhole  and  the 
viewpoint  are  located  at  the  two  hyperboloidal  foci. 
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the  hyperboloid  remain  fixed,  one  at  the  pinhole 
p  and  the  other  at  the  viewpoint  v. 

This  solution  provides  a  practical  approach 
to  wide-angle  imaging.  Yamazawa  et  al.  [Ya- 
mazawa  et  al,  1993]  recognized  that  the  hyper¬ 
boloid,  if  chosen  and  positioned  carefully,  would 
produce  a  single  viewpoint.  They  implemented 
a  sensor  for  autonomous  navigation  and  demon¬ 
strated  the  construction  of  perspective  images 
from  hyperboloidal  ones. 

While  this  solution  is  both  interesting  and  fea¬ 
sible,  it  must  be  implemented  with  care.  As  can 
be  seen  from  Figure  6,  for  any  chosen  value  of 
k,  if  the  viewpoint  is  distant  from  the  pinhole, 
the  mirror  must  be  large.  As  the  viewpoint  ap¬ 
proaches  the  pinhole,  the  mirror  reduces  in  size 
but  the  curvatures  at  all  points  on  the  mirror  in¬ 
crease.  This  increases  the  optical  effects  of  coma 
and  astigmatism  that  are  known  to  produce  blur¬ 
ring  [Hecht  and  Zajac,  1974].  Furthermore,  it 
is  hard  to  configure  an  imaging  system  with  a 
large  enough  depth  of  field  close  to  the  pinhole 
that  would  allow  the  hyperboloidal  mirror  to  be 
placed  in  close  proximity.  These  trade-offs  imply 
that  the  distance  of  the  viewpoint  must  be  cho¬ 
sen  with  care.  Also,  the  axis  of  the  hyperboloid 
must  pass  through  the  pinhole.  For  these  reasons, 
careful  implementation  is  required  to  achieve  the 
desired  optical  properties  and  precise  calibration 
is  needed  to  establish  the  mapping  between  an  in¬ 
coming  principle  ray  and  its  image  coordinates. 
It  is  easy  to  see  that  all  of  the  above  implementa- 
tional  issues  also  apply  to  the  ellipsoidal  solution. 

3.6  Paraboloids 

If  image  projection  is  orthographic  rather  than 
perspective,  the  geometrical  mappings  between 
the  image,  the  mirror  and  the  world  are  invari¬ 
ant  to  translations  of  the  mirror  with  respect  to 
the  imaging  system.  Consequently,  both  calibra¬ 
tion  as  well  as  the  computation  of  perspective 
images  are  greatly  simplified.  There  are  simple 
ways  to  achieve  pure  orthographic  projection,  as 
described  in  [Nayar,  1997]. 

The  shape  of  the  mirror  in  this  case  can  be  de¬ 
rived  by  assuming  orthographic  projection  rather 
than  perspective  projection  in  the  dioptric  stage 
of  image  formation.  This  derivation  is  given  in 
[Nayar,  1997]  and  the  mirror  is  shown  to  be 

^Note  that  fc  <  2  is  not  possible  in  solution  (13)  since 
this  yields  an  imaginary  surface. 


Figure  7:  For  orthographic  projection,  the  solution  is 
a  paraboloid  with  the  viewpoint  located  at  the  focus. 
Orthographic  projection  makes  the  geometric  map¬ 
pings  between  the  image,  the  paraboloidal  mirror  and 
the  world  invariant  to  translations  of  the  mirror.  This 
greatly  simplifies  calibration  and  the  computation  of 
perspective  images  from  paraboloidal  ones. 


paraboloidal  (see  Figure  7).  Paraboloidal  mir¬ 
rors  are  frequently  used  to  converge  an  incoming 
set  of  parallel  rays  at  a  single  point  (the  focus), 
or  to  generate  a  collimated  light  source  from  a 
point  source  (placed  at  the  focus).  In  both  these 
cases,  the  paraboloid  is  a  concave  mirror  that  is 
reflective  on  its  inner  surface.  In  our  case,  the 
paraboloid  is  reflective  on  its  outer  surface  (con¬ 
vex  mirror);  aU  incoming  principle  rays  are  or- 
thographically  reflected  by  the  mirror.  Further, 
the  incoming  rays  can  be  extended  to  intersect  at 
the  focus  of  the  paraboloid,  which  serves  as  the 
viewpoint. 

Alternatively,  the  same  solution  can  be  derived 
from  our  general  solution  for  catadioptric  imag¬ 
ing.  We  know  that  orthographic  projection  is  a 
limiting  case  of  perspective,  where  the  distance 
between  the  pinhole  and  viewpoint  approaches 
infinity.  Equation  (13)  can  be  rewritten  as: 


z'^  zc  1  2 

k  k  k  2 


(24) 


Then,  in  the  limit  c  — >  oo,  k  — >■  oo,  while  keeping 
c/k  =  h  a.  constant,  we  have 


(25) 


The  parameter  h  of  the  paraboloid  is  its  radius  at 
z  =  0.  The  distance  between  the  vertex  and  the 
focus  is  /i/2.  Therefore,  h  determines  the  size  of 
the  paraboloid  that,  for  any  given  orthographic 
lens  system,  can  be  chosen  to  maximize  resolu¬ 
tion.  If  the  paraboloid  is  terminated  at  its  focus, 
the  imaging  system  yields  a  hemispherical  field 
of  view.  As  shown  in  [Nayar,  1997],  two  such 
imaging  systems  can  be  placed  back-to-back  to 
achieve  a  spherical  field  of  view. 

4  Resolution 

Here,  we  define  resolution  as  the  solid  angle  sub¬ 
tended  from  the  viewpoint  by  a  pixel  in  the 
sensed  image.  Let  us  assume  that  the  area  pro¬ 
jected  by  a  pixel  along  its  line  of  sight  is  da,  as 
shown  in  Figure  8.  Note  that  for  orthographic 
projection  da  is  a  constant,  while  for  perspective 
projection  it  is  easily  computed  from  the  distance 
of  the  corresponding  point  on  the  mirror  and  the 
focal  length  of  the  imaging  system.  Given  that 
all  reflections  are  specular,  the  reflecting  surface 
area  occupied  by  da  is  ds  =  da/  cos  <f>.  The 
foreshortened  surface  area  as  seen  by  the  view¬ 
point  V  is  (da /  cos  <f))  cos  (j)  =  da.  The  solid  an¬ 
gle  subtended  by  the  reflecting  surface  element  is 
du  =  da/t'^ ,  where  t  is  the  distance  of  the  surface 
element  from  the  viewpoint.  Hence,  the  spatial 
resolution  for  any  catadioptric  sensor  can  be  writ¬ 
ten  as 

^  ^  J.2  _  ^26) 

do; 

For  instance,  in  the  case  of  a  paraboloidal  mir¬ 
ror  [Nayar,  1997],  the  resolution  increases  by  a 
factor  of  4  from  the  vertex  (t  =  h/2)  of  the 
paraboloid  to  the  fringe  (t  =  h).  With  (26),  it 
is  easy  to  see  that  a  variation  in  spatial  resolu¬ 
tion  occurs  not  only  in  the  case  of  curved  mirrors 
but  also  planar  ones.  In  principle,  it  is  of  course 
possible  to  use  image  detectors  with  non-uniform 
resolution  to  compensate  for  the  above  variation. 
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Abstract 

This  paper  introduces  two  algorithms  for  enhanc¬ 
ing  image  resolution  from  an  image  sequence.  The 
“image-based”  approach  presumes  that  the  images 
were  taken  under  the  same  illumination  conditions 
and  uses  the  intensity  information  provided  by  the 
image  sequence  to  construct  the  high-resolution  im¬ 
age.  When  imaging  from  different  viewpoints,  over 
long  temporal  spans,  or  imaging  scenes  with  moving 
3D  objects,  the  image  intensities  naturally  vary.  The 
“edge-based”  approach,  based  on  edge/blur  mod¬ 
els,  allows  super-resolution  under  lighting  variations. 
The  paper  presents  the  theory  and  the  experimental 
results  using  these  two  algorithms. 

1  Introduction 

The  idea  of  super-resolution,  combining  images  by 
combining  pieces  from  an  image  sequence  into  a  sin¬ 
gle  image  with  higher  resolution  than  any  of  the  in¬ 
dividual  images,  has  been  around  for  years.  Previous 
research  on  super-resolution,  [Huang  and  Tsai- 1984, 
Gross- 1986,  Peleg  et  a/. -1987,  Keren  et  a/. -1988, 
Irani  and  Peleg- 1991,  Irani  and  Peleg- 1 993,  Bascle 
et  fl/.-1996],  ignores  the  impact  of  image  warping 
techniques.  It  also  presumes  that  the  images  were 
taken  under  the  same  illumination  conditions.  This 
paper  summarizes  our  recent  work  which  addresses 
techniques  to  improve  the  quality  of  super-resolution 
imaging  and  to  deal  with  lighting  variations.  We 
show  that  image  warping  techniques  may  have  a 
strong  impact  on  the  quality  of  image  resolution  en¬ 
hancement. 

Image  warping  requires  the  underlying  image  to  be 
“resampled”  at  non-integer  locations;  it  requires  spa- 

*This  work  is  supported  in  part  by  NSF  PYI  IRI-90-5795 1 ,  NSF 
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supported  parts  of  this  research. 
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tially  varying  image  reconstruction.  When  the  goal 
of  warping  is  to  produce  output  for  human  viewing, 
only  mildly  accurate  image  intensities  are  needed.  In 
these  cases,  techniques  using  bi-linear  interpolation 
have  been  found  sufficient.  However,  as  a  step  for 
applications  such  as  super-resolution,  the  precision 
of  the  warped  intensity  values  is  often  important. 
For  these  problems,  bi-linear  image  reconstruction 
may  not  be  sufficient;  the  spatially  varying  nature 
of  the  reconstmction  limits  the  “efficient”  alternative 
reconstruction  methods.  This  paper  shows  how  ideas 
of  imaging-consistent  reconstraction/restoration  al¬ 
gorithms  [Boult  and  Wolberg-1993]  and  the  integrat¬ 
ing  resampler  [Chiang  and  Boult- 1996b]  can  be  used 
for  warping  while  maintaining  superior  image  qual¬ 
ity. 

2  Image-Based  Super-Resolution 

The  idea  of  super-resolution  is  based  on  the  fact  that 
each  image  in  the  sequence  provides  small  amount 
of  additional  information.  There  are,  of  course,  some 
fundamental  limits  on  what  this  combination  can  do. 
If  the  images  were  noise-free,  focused  and  Nyquist 
sampled,  then  multiple  images  would  add  nothing. 
However,  if  the  images  are  blurred  and  with  the 
noise,  and  aliasing  presents  in  images,  deblurring  is 
unstable.  If  time  is  not  a  concern,  then  standard  DSP 
techniques  can  address  these  problems,  formulating 
fiision  as  millions  of  coupled  equations.  The  goal  is 
then  to  come  up  with  an  efficient  approximation. 

Our  approach  recognizes  four  separate  components: 
the  matching  (to  determine  alignment),  the  warping 
(to  align  the  data  and  increase  sampling  rate),  the  fu¬ 
sion  (to  produce  a  less  noisy  image),  and  an  optional 
deblurring  stage  to  remove  lens  blur.  For  now,  we  are 
using  traditional  matching  on  image  fields  (normal¬ 
ized  SSD  or  correlation)  and  traditional  deblurring. 
We  are  concentrating  our  efforts  on  warping  and  fu- 
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Figure  1:  Original  images  and  down-sampled  ver¬ 
sion  of  super-resolution  results,  (a)  shows  one  of 
the  eight  original  images,  (b)  shows  the  down- 
sampled  super-resolution  using  bi-linear  resampling, 
(c)  down-sampled  super-resolution  using  QRS  (d) 
shows  a  deblurred  original  (i.e.,  deblurred  (a), 
(e)  shows  down-sampled  super-resolution  by  back- 
projeetion  (f)  shows  super-resolution  with  QRS  fol¬ 
lowed  by  deblurring  followed  by  down-sampling. 


sion.  Warping  is  considered  in  the  next  section. 

For  fusion,  we  have  experimented  with  simple  av¬ 
eraging,  averaging  with  trimmed  tails,  and  median. 
These  produce  decreasingly  accurate  estimates  with 
increasing  robustness  to  outliers.  As  the  matching 
is  sometimes  inaccurate  and  because  of  aliasing  ar¬ 
tifacts,  a  few  outliers  are  common,  thus  the  trimmed 
tails  is  probably  the  best  overall  technique. 

In  [Chiang  and  Boult- 1996a],  we  presented  initial  re¬ 
sults  and  compared  our  technique  to  the  leading  ex¬ 
isting  work  of  [Irani  and  Peleg-1993]  (which  is  re¬ 
ferred  as  back-projection  in  the  following).  Figs.  1, 
2,  and  3  show  some  example  results.  In  all  cases,  the 
resulting  super-resolution  images  are  a  scale-up  by  a 
factor  of  4.  We  note  that  previous  work  on  this  topic 
reported  results  only  scaling  by  a  factor  of  2. 

If  we  down-sample  our  the  super-resolution  estima¬ 
tion,  we  should  get  an  increase  in  image  quality.  A 
few  examples  of  this  are  shown  in  Fig.  1.  It  can 
be  easily  seen  from  Fig.  1  that  image  warping  tech¬ 
niques  indeed  have  a  strong  impact  on  the  quality 
enhancement,  even  if  the  image  resolution  is  not  in¬ 
creased.  In  particular.  Fig.  If  is  significantly  clearer 
than  the  original  (Fig.  la)  or  a  deblurred  version 
thereof  (Fig.  Id).  Thus,  super-resolution  provides 
added  benefits  even  if  the  final  sampling  rate  is  ex¬ 
actly  the  same  as  the  original. 

Fig.  2  shows  the  final  results  of  our  first  experi¬ 
ment.  Fig.  2a  shows  Fig.  la  blown  up  by  a  factor 


(c) 


Figure  2:  Fig.  la  pixel-replicated  by  a  factor  of  4; 
(b)  super-resolution  by  back-projection;  (c)  super¬ 
resolution  using  QRS  with  deblurring. 
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(C) 

Figure  3:  Effects  of  deblurring  and  reconstruction 
kernel,  (a)  shows  Fig.  2(c)  without  deblurring,  (b) 
shows  super-resolution  using  bi-linear  resampling 
with  deblurring;  and  (c)  shows  (b)  before  deblurring. 


of  4  using  pixel  replication.  Fig.  2b  shows  super¬ 
resolution  by  our  implementation  of  Irani’s  back- 
projection  method  using  bi-linear  resampling  to  sim¬ 
ulate  the  image  formation  process.  Fig.  2c  shows 
super-resolution  using  QRS  followed  by  deblurring. 
Fig.  3  shows  the  effects  of  deblurring  and  of  using 
bi-linear  reconstruction  for  warping. 

Fig.  4  shows  our  second  example — a  more  complex 
gray  level  image.  The  tread-wheels  of  the  toy  tank 
are  visible  in  the  super-resolution  image  but  not  in 
the  originals,  and  the  “tank-number”  is  (just)  read¬ 
able  in  the  super-resolution  image  while  not  in  the 
original. 

Results  from  our  experiments  show  that  the  image- 
based  method  we  propose  herein  is  not  only  com¬ 
putationally  cheaper,  but  it  also  gives  results  compa¬ 
rable  to  or  better  than  those  using  back-projection. 
In  general,  our  method  is  often  more  than  two  or 
three  times  faster.  See  [Chiang  and  Boult- 1996a, 
Chiang-1996]  for  details.  Moreover,  it  is  easily  seen 
from  Figs.  1 ,  2,  and  3  that  integrating  resampler  out¬ 
performs  traditional  bi-linear  resampling. 

3  Imaging-Consistent  Warping 

Central  in  our  imaging-based  technique  is  the  use  of 
what  we  call  imaging-consistent  warping.  Due  to  the 
limit  of  space,  we  only  briefly  review  the  this  idea; 
more  details  (including  the  image  formation  process 
and  the  sensor  model)  can  be  found  in  [Chiang  and 
Boult- 1996b]. 

Consider  the  imaging  model  in  Fig.  5.  An  algorithm 
is  called  imaging-consistent  if  it  is  the  exact  solution 
for  some  input  function,  which,  according  to  the  sen¬ 
sor  model,  would  have  generated  the  measured  input. 
For  image  reconstruction,  we  achieve  this  by  com¬ 
puting  a  functional  restoration  (i.e.,  /2),  then  blurring 
it  again  by  the  pixel’s  PSF.  This  actually  defines  a 
whole  class  of  image  restoration/reconstruction  tech¬ 
niques,  depending  on  the  model  for  /2.  Probably  the 
simplest  method  to  consider  is  based  on  a  piecewise 
quadratic  model  for  the  image.  If  we  assume  a  Rect 
PSF  filter  for  the  photosite,  the  imaging  consistent 
algorithm  is  easy  to  derive  (see  [Chiang  and  Boult- 
1996b]).  To  ensure  that  the  function  is  continuous, 
and  that  the  method  is  local,  we  define  the  value 
of  the  reconstruction  at  the  pixel  boundaries  ki  and 
fej+i,  to  be  equal  to  Ei  and  jEj+i,  where  we  compute 
Ei  with  some  techniques,  e.g.,  cubic  convolution. 
The  values  Ei  at  the  pixel  edges,  combined  with  the 
imaging-consistent  constraint  (the  integral  across  the 
pixel  must  equal  the  measured  intensity)  result  in  ex¬ 
actly  three  constraints.  From  this,  one  can  determine 
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Figure  5:  The  image  formation  process  and  the  relationship  between  restoration,  reconstruction,  and  super¬ 
resolution. 


the  quadratic  polynomial  for  /2.  This  gives  the  intra¬ 
pixel  restoration.  For  super-resolution,  we  consider 
only  this  intra-pixel  restoration  (abbreviated  QRS  in 
the  following  discussion).  Reconstruction  can  be  de¬ 
rived  by  simply  blurring  the  resulting  restoration  by 
a  PSF  of  the  same  scale  as  input. 

To  define  the  integrating  resamplers,  we  generalize 
the  idea  of  the  imaging-consistent  algorithms  de¬ 
scribed  above.  Whereas  imaging-consistent  algo¬ 
rithms  simply  assume  the  degradation  models  are 
identical  for  both  input  and  output;  the  integrating  re¬ 
samplers  go  one  step  further,  allowing  (1)  both  input 
and  output  to  have  their  own  degradation  model,  and 
(2)  the  degradation  model  to  vary  its  size  for  each 
output  pixel. 

When  we  are  resampling  the  image  and  warping  its 
geometry  in  a  nonlinear  manner,  this  new  approach 
allows  us  to  efficiently  do  both  pre-filtering  and  post¬ 
filtering.  Because  we  have  already  determined  a 
functional  form  for  the  input,  no  spatially-varying 
filtering  is  needed,  as  would  be  the  case  if  direct  in¬ 
verse  mapping  were  done.  The  integrating  resampler 
[Chiang  and  Boult- 1996b]  also  handles  antialiasing 
of  partial  pixels  in  a  straightforward  manner. 

4  Edge-Based  Super-Resolution 

For  almost  all  applications  involving  an  image  se¬ 
quence,  the  problem  of  lighting  variation  arises,  even 
when  they  are  taken  consecutively  in  a  well  con¬ 
trolled  environment.  If  the  images  are  not  from  a 
short  time  span,  variations  are  often  significant.  The 
idea  we  propose  herein  is  a  simple  solution,  we  fuse 
edge/blur  information  with  one  of  the  original  in¬ 
tensity  image  to  reconstruct  the  super-resolution  im¬ 
age.  This  effectively  mitigates  the  problem  of  light¬ 
ing  variation  since  edge  positions  are  much  less  sen¬ 
sitive  to  a  change  of  lighting. 

To  fuse  all  the  edges  together,  it  is  required  that  the 
edges  first  be  detected  and  warped.  It  also  requires 
an  image  reconstruction  technique  that  directly  in¬ 


corporates  both  the  edge  and  intensities.  This  will  al¬ 
low  the  reference  image  to  be  reestimated  and  scaled 
up  based  on  the  edge  models  and  local  blur  estima¬ 
tion.  We  have  generalized  the  idea  of  the  imaging- 
consistent  reconstruction  algorithms  to  deal  with  dis¬ 
continuities  in  an  image  [Chiang  and  Boult- 1997]. 

Given  the  image  sequence,  our  edge-based  super¬ 
resolution  algorithm  is  shown,  as  follows; 

1.  Estimate  the  edges  and  blur  models  using  the 
procedure  described  in  Section  4. 1 . 

2.  Estimate  the  motions  involved  in  the  image  se¬ 
quence. 

3.  Choose  one  of  the  images  as  the  reference  image 
(the  one  with  “best”  lighting).  Scale  the  refer¬ 
ence  image  up  and  deblur  at  the  same  time  it  is 
being  scaled  up. 

4.  Warp  all  the  edges/blur  models  to  the  reference 
image  and  fuse  them. 

5.  Use  the  fused  edge/blur  models  and  the  refer¬ 
ence  image  to  compute  the  super-resolution  in¬ 
tensity  image. 

6.  Optional  deblurring  stage. 

4.1  Edge  Localization&  Local  Blur 
Estimation 

Typically,  edge  detection  involves  the  estimation  of 
first  and  second  derivatives  of  the  luminance  func¬ 
tion,  followed  by  selection  of  zero  crossing  and  ex¬ 
trema.  While  the  world  does  not  need  yet  another 
edge  detector,  we  define  a  new  one  because  it  allows 
us  to  work  in  a  consistent  framework,  incorporating 
the  edge  model  into  the  image  reconstruction  algo¬ 
rithm.  The  edge  localization/detection  is  obtained  by 
differentiating  the  functional  form  of  the  image  re¬ 
construction  model  and  considering  only  significant 
extrema  of  the  first  derivative.^  The  edge  model  is 

^Yea,  there  is  a  threshold  hiding  there.  Future  work  will  address 
how  to  better  determine  significant  vs.  insignificant  edges,  and,  more 
importantly,  if  this  should  be  done  before  or  after  the  fusion  of  the  “edge- 
models”. 
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Figure  4:  Results  from  a  toy-tank  sequence,  (a)  one 
of  the  original  images  pixel  replicated  by  a  factor  of 
4;  (b)  super-resolution  using  QRS  followed  by  de¬ 
blurring. 

tied  into  the  image  reconstruction  in  that  each  pixel 
is  now  modeled  as  potentially  having  a  discontinu¬ 
ity,  while  still  satisfying  the  imaging-consistent  con¬ 
straint.  The  model  we  use  is  piecewise  quadratic, 
and  the  integral  across  the  pixel,  including  any  step 
discontinuity,  must  still  equal  the  measured  data.  If  a 
discontinuity  is  included  within  a  pixel,  the  approx¬ 
imations  used  for  the  pixel  boundaries  are  recom¬ 
puted  using  data  from  only  one  side  of  the  discon¬ 
tinuity. 

In  [Chiang  and  Boult- 1996a],  we  showed  that  de¬ 


blurring  after  image  fusion  is  most  effective  for 
super-resolution  imaging.  However,  that  work  pre¬ 
sumes  that  the  blur  is  not  dominated  by  depth-of- 
field  effects.  This  allows  us  to  replace  a  spatially- 
varying  point  spread  function  with  a  cascade  of  two 
simpler  components:  a  spatially-invariant  blur  and  a 
geometric  warp.  Unfortunately,  this  assumption  is 
rarely  true  in  practice. 

In  [Chiang  and  Boult-1997]  we  describe  our  local 
blur  estimation  in  detail.  In  summary,  we  use  a  sim¬ 
ple  blurred  step  edge,  assuming  that  the  step  edge 
is  from  u  to  u  -f  Su{x)  where  v  is  the  unknown  in¬ 
tensity  value  and  S  is  the  unknown  amplitude  of  the 
edge.  The  blur  of  this  edge  is  modeled  by  a  “trun¬ 
cated”  Gaussian  blur  kernel 

where  a  is  the  unknown  standard  deviation.  The 
complete  edge  model  is  thus  given  by  Eq.  (1)  and 
Eq.  (2)  where  a  is  a  predefined  nonnegative  constant, 
and  a:o  €  [0, 1]  is  the  location  of  the  step  edge.  Given 
the  functional  form  of  our  reconstruction,  we  can 
solve  directly  for  the  three  parameters  of  the  blurred 
edge  model. 

For  the  examples,  we  again  presume  that  “motion” 
is  computed,  which  for  general  lighting  changes  is 
much  more  difficult.  For  the  examples,  here  we  use 
a  normalized  SSD  computation.  Fig.  6  shows  two 
of  the  eight  original  images  with  two  different  illu¬ 
mination  conditions  blown  up  by  a  factor  of  4  us¬ 
ing,  respectively,  pixel  replication  and  bi-linear  re¬ 
sampling.  Figs.  6a  and  b  shows  originals  magnified 
with  pixel  replication,  while  Figs.  6c  and  d,  show  the 
edge-based  super-resolution  results  before  and  after 
deblurring. 

5  Image-Based  vs.  Edge-Based 

Due  to  the  limit  of  space,  we  only  briefly  compare 
the  two  algorithms  proposed  herein.  Both  algorithms 
take  time  roughly  proportional  to  the  number  of  im¬ 
ages  in  the  image  sequence,  with  the  image-base  fu¬ 
sion  being  the  faster  of  the  two,  producing  a  500x500 
super-resolution  image  in  a  few  seconds  on  a  Ultra- 
spare. 

If  the  variation  of  lighting  is  small,  such  as  in  an 
controlled  indoor  environment,  the  image-based  ap¬ 
proach  is  more  appropriate  because  it  uses  the  in¬ 
tensity  information  provided  by  the  whole  image  se¬ 
quence  to  construct  the  super-resolution  image  and 
thus  is  better  at  removing  the  noise  and  undesirable 
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artifacts.  On  the  other  hand,  the  edge-based  algo¬ 
rithm  is  more  appropriate  if  the  variation  of  illumi¬ 
nation  is  large. 

If  the  variation  of  lighting  is  intermediate,  a  possible 
solution  is  probably  a  hybrid  of  the  two  algorithms 
we  propose  herein.  The  idea  is  that  instead  of  choos¬ 
ing  a  single  reference  image  of  the  edge-based  super¬ 
resolution  algorithm,  use  the  averaging  or  median  of 
a  sub-sequence  out  of  the  image  sequence  as  the  ref¬ 
erence  image,  presuming  that  the  variation  of  light¬ 
ing  is  not  so  significant  within  the  sub-sequence. 

6  Future  Work 

Further  work  is  needed  before  the  super-resolution 
algorithm  is  robust  enough  for  general  use  in  VS  AM 
applications;  in  particular,  we  need  to  incorporate 
more  robust  sub-pixel  matching  algorithms  and  in¬ 
clude  better  deblurring  algorithms.  Quantitative 
analysis  of  both  approaches  is  now  under  way  using 
recognition  rates  as  the  benchmark. 

7  Conclusion 

This  paper  introduces  two  algorithms  for  enhanc¬ 
ing  image  resolution  from  an  image  sequence.  The 
image-based  approach  presumes  that  the  images 
were  taken  under  the  same  illumination  conditions 
and  uses  the  intensity  information  provided  by  the 
image  sequence  to  construct  the  super-resolution  im¬ 
age.  The  edge-based  approach,  based  on  edge  mod¬ 
els  and  a  local  blur  estimate,  circumvents  the  diffi¬ 
culties  caused  by  lighting  variations.  We  show  that 
image  warping  techniques  may  have  a  strong  impact 
on  the  quality  of  image  resolution  enhancement. 
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Abstract 

Using  multiple  sensors  in  a  vision  system  can  sig¬ 
nificantly  reduce  both  human  and  machine  errors  in 
detection  and  recognition  of  objects.  A  particular 
case  of  interest  is  where  images  from  possibly  dif¬ 
ferent  types  of  sensors  are  to  be  combined.  An  im¬ 
age  fusion  scheme  is  proposed  which  combines  as¬ 
pects  of  feature-level  fusion  with  the  pixel-level  fu¬ 
sion.  Images  are  fused  by  combining  their  wavelet 
transforms.  The  identification  of  important  features 
in  each  image,  such  as  edges  and  regions  of  interest, 
are  used  to  guide  the  fusion  process.  Experiments 
show  that  this  algorithm  works  well  in  many  situa¬ 
tions. 

1  Introduction 

In  recent  years,  multisensor  image  fusion  has  re¬ 
ceived  significant  attention  for  both  military  and  in¬ 
dustrial  applications.  Concealed  weapon  detection 
(CWD)  is  one  interesting  application.  CWD  ap¬ 
pears  to  be  a  critical  technology  for  dealing  with  ter¬ 
rorism.  Detecting  concealed  weapons  is  especially 
difficult  when  one  wants  to  monitor  an  area  where 
portal  systems  are  not  practical.  Portable  systems, 
which  could  be  placed  in  a  police  car,  would  be  de¬ 
sirable.  Due  to  the  difficult  nature  of  the  problem, 
an  extensive  study  indicated  that  no  single  sensor 
technology  can  provide  acceptable  performance  over 
all  of  the  scenarios  of  interest  [Currie  et  a/. -19961. 
This  justifies  a  study  of  fusion  techniques  to  achieve 
improved  CWD  procedures.  A  number  of  compat¬ 
ible  sensor  technologies  have  already  been  identi¬ 
fied  which  could  provide  improved  performance  if 
a  fusion  scheme  were  available  [Currie  et  tz/. -19961. 

•This  work  is  supported  in  part  by  ONR/DOD  MURI  program,  con¬ 
tract  NOOOl  4-95- 1-0601.  The  source  images  in  Fig.  3  were  obtained 
from  Thermotex  Corporation. 


Most  of  the  technologies  produce  images,  so  image 
fusion  is  of  interest. 

We  use  the  term  image  fusion  to  denote  a  process 
by  which  multiple  images  or  information  from  mul¬ 
tiple  images  are  combined.  These  images  may  be 
obtained  from  different  types  of  sensors.  The  ma¬ 
jority  of  research  on  image  fusion  can  be  classified 
into  the  two  categories:  pixel-level  image  fusion  and 
feature-level  image  fusion  [Luo  and  Kay-951.  Pixel- 
level  fusion  generates  a  fused  image  in  which  each 
pixel  is  determined  from  a  set  of  pixels  in  source  im¬ 
ages.  The  fused  image  is  expected  to  be  such  that 
the  performance  of  a  particular  task  of  interest,  such 
as  object  detection,  can  be  improved.  Feature-level 
fusion  first  employs  feature  extraction  separately  on 
each  image  and  then  performs  the  fusion  based  on 
the  extracted  features.  It  enables  the  detection  of 
useful  features  with  higher  confidence,  and  a  fused 
image  is  not  necessarily  generated  in  this  case.  Cur¬ 
rently,  it  appears  that  more  people  are  focusing  on 
pixel-level  fusion. 

Image  fusion  based  on  pyramid  decomposition  is 
one  of  the  popular  fusion  methods.  Pyramid  de¬ 
composition  methods  construct  a  fused  pyramid  rep¬ 
resentation  from  the  pyramid  representations  of  the 
original  images.  The  fused  image  is  then  obtained 
by  taking  an  inverse  pyramid  transform.  The  ap¬ 
proach  was  apparently  first  introduced  in  [Burt  and 
Adelson-1983,  Burt-19841  for  image  coding  and 
binocular  fusion  in  human  vision.  Several  other 
pyramid-based  image  fusion  schemes  were  pro¬ 
posed  in  [Toet-1990,  Akerman  III- 1992,  Burt  and 
Lolczynski- 19931.  More  recently,  approaches  based 
on  the  wavelet  transform  have  begun  to  receive  con¬ 
siderable  attention  [Ranchin  et  aZ.- 1993,  Chipman  et 
aZ.-1995,  Li  et  £zZ.-1995l.  In  [Ranchin  et  flZ.-1993l, 
the  authors  studied  fusion  based  on  multiresolution 
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image  decomposition  and  reconstruction  using  tlie 
wavelet  transform.  Tlrey  presented  a  technique  tor 
enhancing  the  spatial  resolution  of  a  SPOT  image 
using  another  image  from  a  different  band  from  the 
same  satellite.  In  [Chipman  et  a/.-1995],  the  focus 
is  on  fusing  multispectral  aerial  photos  using  a  set  of 
basic  operations  on  particular  sets  of  wavelet  coeffi¬ 
cients  which  correspond  to  certain  frequency  bands. 
In  [Li  et  a/. -1995],  a  wavelet  transform  approach  is 
considered  which  uses  an  area-based  maximum  se¬ 
lection  rule  and  a  consistency  verification  step.  The 
wavelet-transform-based  approaches  exhibit  advan¬ 
tages  in  terms  of  compactness,  directional  selectivity 
and  orthogonality  [Li  et  a/. -1995].  However,  previ¬ 
ous  research  had  considered  relatively  simple  meth¬ 
ods  for  combining  the  wavelet  coefficients  which 
didn’t  make  full  use  of  the  spacial  information  con¬ 
tained  in  the  source  images. 

In  this  paper,  we  illustrate  a  wavelet  transform  based 
image  fusion  approach  where  we  combine  aspects  of 
both  pixel-level  and  feature-level  fusion.  The  feature 
used  is  an  object  or  region  of  interest  which  we  refer 
to  as  a  region  here.  Since  objects  and  parts  of  objects 
carry  the  information  of  interest,  it  is  reasonable  to 
focus  on  them  in  the  fusion  algorithm.  While  previ¬ 
ous  researchers  consider  each  pixel  in  the  image  sep¬ 
arately,  or  just  consider  the  pixel  and  its  close  neigh¬ 
bors,  such  asa3x3or5x5  window,  they  neglect 
the  fact  that  each  pixel  is  just  one  part  of  an  object 
or  region.  The  objects  are  what  we  are  really  inter¬ 
ested  in.  By  considering  each  pixel  separately,  noise 
and  blurring  effects  are  often  introduced  during  the 
fusion  process.  Our  region-based  method  appears  to 
be  a  good  way  to  solve  these  problems. 

In  this  paper,  we  consider  fusion  of  two  images  only. 
Extensions  to  more  tlian  two  images  can  be  devel¬ 
oped  in  a  similar  way.  Tire  proposed  fusion  scheme 
is  described  in  Section  2.  Some  experimental  results 
are  presented  in  Section  3.  Section  4  presents  our 
conclusions. 

2  The  Region-Based  Image  Fusion  Scheme 

We  take  it  as  a  prerequisite  that  the  source  images 
must  be  registered,  so  that  the  corresponding  pixels 
are  aligned.  The  discrete  wavelet  transform  of  each 
of  the  two  registered  images  is  computed.  Then,  us¬ 
ing  a  scheme  discussed  later,  the  decision  map  is 
generated.  Each  pixel  of  tire  decision  map  denotes 
which  image  best  describes  this  pixel.  Based  on  the 
decision  map,  we  fuse  the  two  images  in  the  wavelet 
transform  domain.  The  final  fused  image  is  obtained 
by  taking  the  inverse  wavelet  transform. 


For  each  source  image,  the  edge  image,  region  image 
and  region  activity  table  are  generated  as  shown  in 
Figure  1 .  Next,  the  region  activity  tables  of  each  im¬ 
age  are  used  to  create  the  fusion  decision  map.  This 
is  also  illustrated  in  Figure  1.  Each  pixel  in  the  fu¬ 
sion  decision  map  tells  which  image  should  be  used 
to  provide  the  wavelet  coefficients  related  to  the  cor¬ 
responding  pixel  in  the  region  image. 


Sour«  Image  1  Source  Image  2 


Figure  1:  Data  flow  for  creating  the  decision  map 


2.1  Wavelet  Transform 

The  wavelet  transform  [Vetterli  and  Herley-1992, 
Mallat-1989]  of  an  image  provides  a  multiscale 
pyramid  decomposition  for  the  image.  This  decom¬ 
position  will  typically  have  several  stages.  There 
are  four  frequency  bands  after  each  decomposition. 
These  are  the  low-low,  low-high,  high-low  and  high- 
high  bands.  The  next  stage  of  the  decomposition 
process  operates  only  on  the  low-low  part  of  the  pre¬ 
vious  result.  This  produces  a  pyramid  hierarchy  as 
shown  in  Figure  2,  in  which  the  top  of  the  pyramid, 
denoted  by  LL^,  is  a  low-low  frequency  band.  We 
can  think  of  this  low-low  band  as  the  lowpass  fil¬ 
tered  and  subsampled  source  image.  All  the  other 
bands  which  we  call  high  frequency  bands  contain 
transform  coefficients  that  reflect  the  differences  be¬ 
tween  neighboring  pixels  and  thus  can  be  positive  or 
negative.  If  we  are  dealing  with  grayscale  images, 
then  the  absolute  values  of  the  high  frequency  coef¬ 
ficients  represent  the  intensity  of  brightness  fluctua¬ 
tion  of  the  scene  at  a  given  scale.  The  larger  values 
imply  more  distinct  brightness  changes  which  typ¬ 
ically  correspond  to  the  salient  features  of  objects. 
Thus,  a  simple  fusion  rule  is  to  select  the  larger  ab¬ 
solute  value  of  the  two  corresponding  wavelet  coef¬ 
ficients  from  each  of  the  two  source  images.  There 
are  two  disadvantages  of  this  method.  It  may  have 
high  sensitivity  to  noise  and  it  may  produce  a  blur¬ 
ring  effect.  Tb  eliminate  these  undesirable  features. 
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we  first  divide  each  image  into  different  objects  and 
regions.  Then,  instead  of  performing  tlie  fusion  pixel 
by  pixel,  we  make  the  decision  object  by  object  and 
region  by  region.  Thus  in  tlie  fused  image,  each  ob¬ 
ject  will  be  described  by  the  data  from  the  clearer  of 
the  two  images. 


LL^ 

LH^ 

lh‘ 

HL^ 

hl' 

HH‘ 

1, 2  -  Resolution 
levels 

H  —  High  frequency 
band 

L—  Low  frequency 
band 


Figure  2;  Pyramid  Merarchy  of  wavelet  transform 


2.2  Region  Labeling 

We  apply  Canny  edge  detection  to  the  low-low  band 
of  the  coefficient  pyramid  obtained  from  the  wavelet 
transform.  The  low-low  band  has  a  lower  resolution 
than  the  source  image,  but  it  still  contains  the  spacial 
region  information.  The  output  of  the  Canny  detec¬ 
tor  is  an  edge  image,  on  wWch  region  segmentation 
is  performed  using  a  labeling  algorithm  described  in 
[Zhang  and  Blum- 1997].  Finally,  we  get  a  labeled 
image  in  which  each  different  value  represents  a  dif¬ 
ferent  region,  zero  corresponds  to  edges. 

The  focus  of  this  paper  is  on  image  fusion.  While 
we  have  employed  some  specific  edge  detection  and 
region  labeling  algorithms,  otlier  edge  detection  and 
region  labeling  algorithms,  wWch  may  come  from 
current  or  future  studies,  can  be  easily  substituted 
for  ours.  The  edge  detection  and  labeling  algorithms 
we  choose  are  not  necessarily  tlie  best.  They  just 
illustrate  our  approach. 

2.3  Fusion 

Information  on  salient  features  of  an  object  is  par¬ 
tially  captured  by  the  magnirnde  of  high  frequency 
wavelet  coefficients  that  corresponding  to  that  ob¬ 
ject.  Consider  two  regions  with  similar  size  and 
signal-to-noise  ratio  (SNR)  in  two  registered  images 
which  each  represent  the  same  object  in  a  real  scene. 
The  one  which  has  the  larger  magnitude  high  fre¬ 
quency  components  will  generally  contain  more  de¬ 
tail.  Under  this  assumption,  we  first  calculate  the  ac¬ 
tivity  level  of  each  region  as  the  average  of  the  abso¬ 
lute  value  of  the  high  frequency  band  wavelet  coeffi¬ 
cients.  Next  we  generate  the  decision  map  according 
to  the  activity  level,  size  and  position  of  each  object 
or  region.  If  the  SNRs  of  two  images  are  different. 


this  can  be  accounted  for  by  introducing  a  weight 
factor  in  the  activity  level  calculation. 

Recall  that  the  region  image  corresponds  to  the  low- 
low  band  of  the  wavelet  coefficient  pyramid.  The 
activity  level  of  region  k  in  source  image  n  Aln(k)  is 
given  by 


Alnik) 


1 


E  p. 

l<j<Nk 


(1) 


where  is  the  total  number  of  pixels  in  region  k,  Pj 
is  the  activity  intensity  of  pixel  j  in  region  k,  which 
is  given  by 


M 


E 

l<m<M 


3 


1 

22(M— m) 


E  lCi 

l<i<3-22(A^-'") 


(2) 

where  W  is  the  weight  which  is  determined  by  the 
SNR  of  the  image  and  other  factors,  M  is  the  num¬ 
ber  of  wavelet  decomposition  levels,  Ci  is  one  of  the 
wavelet  coefficients  in  high  frequency  bands  corre¬ 
sponding  to  pixel  j.  The  second  sum  in  (2)  is  over  all 
the  wavelet  coefficients  that  correspond  to  pixel  j  in 
the  high  frequency  bands  of  the  mth  decomposition 
stage. 


Next  we  describe  how  to  produce  the  binary  decision 
map.  Suppose  we  have  two  registered  images  A  and 
B  to  be  fosed.  If  a  given  pixel  in  the  decision  map  is 
a  “1”  then  all  the  wavelet  coefficients  corresponding 
to  this  pixel  are  taken  from  image  A.  If  the  pixel  is 
“0”,  all  the  wavelet  coefficients  corresponding  to  this 
pixel  are  taken  from  image  B.  For  a  specific  pixel  of 
the  deeision  map,  P(i,j),  this  pixel  may  be: 


1.  in  region  m  of  image  A,  and  region  n  of  image 
B 

2.  an  edge  point  in  one  image,  and  in  certain  region 
in  the  other  image 

3.  an  edge  point  in  both  images 

We  assign  the  value  of  each  pixel  in  decision  map 
according  to  the  following  criteria: 

•  Small  regions  preferred  over  large  regions  when 
comparing  activity  levels 

•  Edge  points  preferred  over  non-edges  point 
when  comparing  activity  levels 

•  High  activity-level  preferred  over  low  activity- 
level 

•  Make  decision  on  non-edge  points  first  and  con¬ 
sider  their  neighbors  when  making  the  decision 
on  edge  points 

•  Avoid  isolated  points  in  decision  map 


1449 


A  binary  decision  map  is  now  created  to  fuse  the 
two  wavelet  coefficient  arrays  into  one.  Each  pixel 
in  the  decision  map  corresponds  to  a  set  of  wavelet 
coefficients  in  each  frequency  band  of  all  decompo¬ 
sition  levels.  The  size  of  the  decision  map  is  just 
^  of  the  original  image  where  M  is  the  number  of 
decomposition  stages.  The  value  of  M  should  not 
be  too  small,  or  we  can  not  take  the  advantage  of 
the  decrease  in  image  size  due  to  the  wavelet  trans¬ 
form.  In  this  case,  the  computation  complexity  will 
increase  sharply.  A  large  decomposition  value  is  also 
not  desired  since  resolution  for  region  detection  will 
be  low.  Practically,  the  choice  of  M  is  made  accord¬ 
ing  to  the  size  of  source  image  and  its  resolution.  For 
our  second  example  in  Fig.  4,  which  uses  512  x  512 
source  images,  an  appropriate  number  of  decompo¬ 
sition  stages  is  two.  In  diis  case,  the  size  of  edge 
image,  region  image  and  decision  map  will  be  128  x 
128. 

3  Experimental  Result 

We  tested  our  algorithm  on  several  pairs  of  images. 
Some  of  the  result  are  described  here.  Figure  3 
shows  a  pair  of  visual  and  94GHz  millimeter-wave 
(MMW)  images.  The  visual  image  provides  the 
outline  and  the  appearance  of  the  people  while  the 
MMW  image  shows  die  existence  of  a  gun.  From  the 
fused  image,  we  can  clearly  and  promptly  see  that 
the  person  on  the  right  has  a  concealed  gun  beneadi 
his  clothes.  This  fused  image  may  be  very  helpful 
to  a  police  officer,  for  example,  who  must  response 
quickly.  Figure  4  shows  a  pair  of  multifocused  test 
images.  In  one  image,  the  focus  is  on  tlie  Pepsi  can. 
In  the  other  image,  the  focus  is  on  the  testing  card. 
In  the  fused  image,  the  Pepsi  can,  the  table,  and  the 
testing  card  are  all  in  focus.  Tliese  examples  illus¬ 
trate  that  our  algorithm  works  in  cases  when  the  im¬ 
ages  either  come  from  the  same  or  different  types  of 
sensors. 

4  Conclusion 

We  have  presented  a  new  approach  to  multi-sensor 
image  fusion  which  combines  the  frequency  infor¬ 
mation  from  wavelet  transform  with  tire  spacial  in¬ 
formation  from  the  original  image.  We  use  a  partic¬ 
ular  image  feature,  regions  which  we  believe  repre¬ 
sent  objects,  to  guide  the  fusion  process.  Since  ob¬ 
jects  and  parts  of  objects  carry  die  information  of 
interest,  it  is  reasonable  to  focus  diem  in  the  fusion 
algorithm.  Concealed  weapon  detection  is  one  inter¬ 
esting  application  of  our  fusion  algorithm.  However, 
our  algorithm  can  also  be  used  in  other  applications. 
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(a).  Image  A  (Visual)  (b).  Image  B  (  94GHz  MMW)  (c).  Fused  image 


(d).  Edge  image  A  (e).  Edge  image  B  (f).  Region  image  A  (g).  Region  image  B  (h).  Decision  map 


Figure  3:  Fusion  result  on  visual  and  radiometric  images 


(d).  Edge  image  A  (e).  Edge  image  B  (f).  region  image  A  (g).  Region  image  B  (h).  Decision  map 


Figure  4;  Fusion  result  on  multi-focus  images 
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Abstract 

Early  vision  relies  heavily  on  uniform,  rectangu¬ 
lar  operators  for  tasks  such  as  segmentation  or 
computing  correspondence.  While  rectangular 
windows  have  obvious  efficiencies,  they  poorly 
model  the  boundaries  of  real-world  objects.  We 
have  developed  an  efficient  method  for  adap¬ 
tively  choosing  a  window  without  a  rectangular 
bias,  in  a  manner  which  varies  at  each  pixel.  We 
model  an  image  as  a  piecewise  constant  func¬ 
tion  corrupted  by  noise,  and  explicitly  consider 
all  possible  connected  components.  Almost  all 
components  can  be  pruned,  however,  by  a  sim¬ 
ple  maximum  likelihood  argument.  The  remain¬ 
ing  components  can  be  compared  by  a  variety  of 
methods,  including  (for  example)  global  contex¬ 
tual  constraints.  Our  approach  can  be  applied 
to  many  problems,  including  image  restoration, 
motion  and  stereo.  It  can  help  solve  a  number 
of  well-known  problems,  including  the  aperture 
problem.  Our  methods  run  in  a  few  seconds 
on  traditional  benchmark  images  with  standard 
parameter  settings,  and  give  quite  promising  re¬ 
sults. 

1  Introduction 

Many  problems  in  early  vision  are  ill-posed,  i.e. 
they  cannot  be  uniquely  solved  without  addi¬ 
tional  constraints.  Scene  geometry  can  pro¬ 
vide  constraints  that  make  these  problems  well- 
posed.  For  efficiency  reasons,  most  algorithms 
make  use  of  rectangular  windows,  which  poorly 
model  the  boundaries  of  real-world  objects.  In 


this  paper  we  present  an  approach  to  early  vi¬ 
sion  problems  that  efficiently  chooses  a  window 
adaptively  at  each  pixel  with  no  rectangular 
bias.  The  method  can  be  used  for  image  restora¬ 
tion,  as  well  as  for  motion  or  stereo. 

We  will  begin  by  looking  at  image  restoration, 
and  then  generalize  to  motion  and  stereo.  We 
model  an  image  as  a  set  of  constant  intensity 
regions  (or  connected  components)  that  are  cor¬ 
rupted  by  noise.  Our  approach  has  three  parts. 
We  first  consider  all  possible  intensities  for  all 
possible  components.  We  will  call  the  pair  con¬ 
sisting  of  a  component  and  an  intensity  a  com- 
ponent  hypothesis.  A  simple  maximum  likeli¬ 
hood  argument  prunes  almost  all  the  compo¬ 
nent  hypotheses  from  consideration.  The  sec¬ 
ond  part  of  our  method  is  to  rank  the  different 
component  hypotheses.  An  easy  way  to  do  this 
is  by  preferring  the  largest  components.  The  fi¬ 
nal  part  of  our  method  is  to  assign  each  pixel  in 
the  image  an  intensity.  This  can  be  done  locally, 
or  with  global  consistency  constraints. 

2  Maximum  likelihood  hypothesis 
testing 

Consider  the  problem  of  piecewise  constant  im¬ 
age  restoration.  Assume  that  It{P)  denotes  the 
true  intensity  of  a  pixel  P  and  that  I{P)  de¬ 
notes  an  observed  intensity  of  the  same  pixel 
P  after  the  picture  is  corrupted  by  noise,  i.e. 
I{P)  =  It{P)  -I-  v{P)-  If  5  is  a  connected  set 
of  pixels  and  i  is  an  intensity,  we  will  initially 
consider  all  possible  component  hypotheses.  A 
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component  hypothesis  5‘  states  that  the  pixels 
in  the  component  S  have  the  intensity  i  in  It. 

Let  us  pick  a  particular  intensity  i.  The  key  step 
in  our  method  is  that  a  pixel  P  is  included  in  5* 
if  and  only  if  P  is  more  likely  to  have  intensity  i 
than  not.  If  every  pixel  in  5‘  has  this  property, 
then  we  will  call  the  component  hypothesis  5* 
plausible.  By  only  considering  plausible  compo¬ 
nent  hypotheses,  we  can  enormously  reduce  the 
amount  of  work. 

To  enforce  plausibility  we  must  decide  between 
the  following  two  hypotheses: 

Ho  :  It{P)  =  U 
Hi  :  It{P)¥=l 

Assume  that  the  noise  model  is  given  by  the 
function 


fix\i)  =  PT{I(P)^x\It{P)=i).  (1) 

This  function  may  correspond  to  a  uniform, 
Gaussian,  or  any  other  distribution  function. 
We  will  write  flx\i)  as  f{I{P)\i)  where  I{P) 
will  serve  as  an  intensity  value  observed  in  a 
fixed  experiment  (not  as  a  random  variable). 

We  choose  between  Hq  and  Hi  by  comparing 
the  likelihoods  Pr(J(P)lifo)  and  Pr(J(F)|Ifi); 
in  other  words,  we  assume  there  is  no  prior  bias 
in  favor  of  Hq  or  Hi .  The  hypothesis  with  the 
largest  likelihood  wins.  Obviously,  we  have 

Pr(7(P)|Fo)  =  /(W|i)- 

To  compute  Pr{I{P)\Hi)  we  proceed  as  follows: 


Pt{I{P)\Hi) 


Pt{I{P),Hi) 

Pr{Hi) 

^  Pr(Pi) 

V  f(nH)li)Pr(It(P)  =  i) 
h  Pr(ffi) 


We  assume  that  prior  probabilities  of  all  inten¬ 
sities  are  equal.  This  implies  that  Pr(/f(P)  =  i) 
does  not  depend  on  i.  Thus, 


where  |P|  denotes  the  number  of  all  possible 
intensities.  Therefore 

To  choose  Hq  over  Hi  we  require 

Finally,  this  comparison  test  can  be  rewritten 
as 

(2) 

so  that  we  accept  hypothesis  Hq  in  cases  where 
the  likelihood  of  intensity  i  is  larger  than  or 
equal  to  the  average  likelihood  of  all  possible  in¬ 
tensities. 

At  this  point  it  becomes  obvious  that  we  can 
use  any  noise  model  /(  •  K)  in  formula  (2).  If 
/( •  |i)  is  uniform,  Gaussian,  or  any  other  non¬ 
increasing  symmetric  distribution  function  cen¬ 
tered  at  i  then  test  (2)  can  be  reformulated  as 

\I{P)  -  i|  <  e  (3) 

where  e  is  a  parameter  that  depends  on  the  par¬ 
ticular  distribution  function  /(  •  |i)  in  hand. 
Note  that  the  threshold  e  is  the  same  for  all 
pixels  P  in  the  image.  This  furnishes  a  compu¬ 
tationally  trivial  way  to  compute  the  plausible 
component  hypotheses. 

2.1  Motion  or  stereo 

Consider  now  the  problem  of  motion  of  stereo. 
We  will  denote  a  disparity  by  d.  We  will  write 
the  statement  that  pixel  P  has  disparity  d  by 
P^.  I{P)  and  I'{P)  will  represent  intensities  of 
pixel  P  in  the  first  and  in  the  second  images, 
respectively.  Consider  some  fixed  disparity  d 
for  pixel  P.  Again,  we  will  enforce  plausibility 
on  every  component  hypothesis.  We  need  to 
choose  between  the  two  hypotheses: 

Ho  :  P^ 

Hi  :  -nP^ 

Assume  that  the  function  fi’\i')  specifies  the 
noise  model,  that  is  the  distribution  of  inten¬ 
sity  of  a  pixel  in  the  first  image  given  intensity 
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Pr(Pi)  =  (|P|-l)-Pr(/t(P)=i),  Vi 


i'  of  the  corresponding  pixel  in  the  second  im¬ 
age.  For  convenience,  we  define  for  any  event  E 
Fv'{E)  =  Fv{E\I'{Pi),...,I'{Pn))  =  Pv{E\I'), 
which  is  the  probability  of  E  conditioned  on  all 
the  observed  intensities  from  the  image  I'.  Sim¬ 
ilarly  we  define  Pr'(J5|F)  =  Pv{E\F,I'). 

Then 

PT'{I{P)\P‘^)  =  fiIiP)\l'(P  +  d)) 

where  P  -I-  d  is  a  pixel  obtained  by  shifting  P  by 
disparity  d. 

We  choose  between  Hq  and  Hi  by  comparing 
the  likelihoods  Pr'(/(P)|Po)  and  Pr'(/(P)|Pi). 
Obviously,  we  have 

Fv'{I{P)\Ho)  =  fiIiP)\I'{P  +  d)). 

To  estimate  Pr'(7(P)|Pi)  we  proceed  as  follows: 

pr’imm  = 

—  Px'(I(P),F‘) 

h: 


=  E 

d^d 


=  E 


/(/(P)|P(P  +  d))Pr'(P<^) 
Pr'(Pi) 


We  assume  that  prior  probabilities  of  all  dispar¬ 
ities  are  equal.  This  implies  that  Fi'{P^)  does 
not  depend  on  d.  Consequently, 

Pr'(Pi)  =  (|P|-l)-Pr'(P<^),  Vd 

where  |P|  denotes  the  number  of  all  possible 
disparities.  Therefore 

Pr'(/(P)|Pi)  =  pjrY)-E  fim\i'iP+d)). 


To  prefer  Hq  over  Hi  we  should  have 

/(/(p)i/'(p +<?))> 

p/ny-E /(wiAP+d)). 


Finally,  this  comparison  test  can  be  equivalently 
rewritten  as 

/(7(p)i/'(p-kd))  >  p-E  fim\np+d)). 


so  that  we  accept  hypothesis  Hq  in  cases  where 
the  likelihood  of  disparity  d  is  larger  than  or 
equal  to  the  average  likelihood  of  all  possible  dis¬ 
parities. 

We  can  use  any  noise  model  /  ( •  1  )  in  formula 

(4).  If  /  satisfies^  f{x\i')  =  f{x  —  i')  and  if 
AP'^  denotes  7(P)  -  7'(P  +  d)  then  test  (4)  is 
equivalent  to 

/(AP^‘)>|E.X:  /(AP'").  (5) 

d 

This  test  is  equivalent  to 

lAP*^’!  <  e 

where  e  depends  on  the  noise  model  /.  Again, 
this  provides  a  computationally  trivial  way  to 
compute  the  plausible  component  hypotheses. 

2.2  Comparing  component 

hypotheses  and  producing  pixel 
estimates 

The  results  above  show  that  instead  of  con¬ 
sidering  all  possible  component  hypotheses  we 
only  need  consider  a  small  set.  In  particular, 
for  a  particular  hypothesis  (intensity  or  dispar¬ 
ity),  we  only  consider  plausible  components,  i.e. 
components  of  pixels  whose  likelihood  under 
that  hypothesis  is  above  the  pixel’s  average  like¬ 
lihood  for  all  hypotheses.  There  are  typically  a 
small  number  of  plausible  component  hypothe¬ 
ses. 

The  next  step  is  to  rank  the  different  component 
hypotheses  according  to  some  preference  crite¬ 
rion.  The  criterion  should  combine  two  sources 
of  information:  geometric  information,  encoded 
as  priors,  as  to  which  segmentations  are  most 
likely;  and  statistical  information  regarding  the 
residuals  (i.e.  AP*^)  within  the  component.  In 
most  of  our  experiments  so  far  we  have  used 
the  simple  criterion  that  the  largest  component 
is  the  best  explanation.  However,  more  complex 
priors  may  also  be  used,  and  contextual  infor¬ 
mation  can  be  incorporated.  For  example,  with 
stereo  it  is  often  desirable  to  find  a  ground  plane 

*This  holds  if  /(  •  |  i'  )  is  uniform,  Gaussian,  or 
any  other  (bell-shaped)  symmetric  distribution  function 
centered  at  i'. 
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somewhere  in  the  lower  part  of  the  image;  this 
constraint  can  be  incorporated  into  the  prefer¬ 
ence  criterion.  Statistical  information  about  the 
residuals  can  also  be  used;  at  the  correct  match, 
the  residuals  should  come  from  measurement 
noise,  and  should  (for  example)  be  unbiased. 

Once  the  component  hypotheses  have  been 
ranked,  the  final  step  is  to  assign  a  value  (in¬ 
tensity  or  disparity)  to  each  pixel.  The  simplest 
solution  is  to  do  this  purely  locally.  For  each 
pixel  P,  we  consider  those  component  hypothe¬ 
ses  containing  P.  The  best  component  hypoth¬ 
esis,  under  the  preference  criterion,  will  be  used 
to  give  P  its  value.  To  date,  we  have  mostly 
used  this  local  method,  but  more  complex  meth¬ 
ods  may  eventually  be  desirable.  For  example, 
it  may  be  important  to  impose  a  kind  of  global 
consistency,  so  that  every  pixel  in  the  best  com¬ 
ponent  hypotheses  is  ensured  the  same  value. 

2.3  Efficiency 

The  computation  of  the  plausible  component 
hypotheses  can  be  done  in  linear  time,  in  a 
single  pass  over  the  images.  The  methods  we 
have  used  for  comparing  component  hypotheses, 
and  for  assigning  values  to  pixels,  are  similarly 
efficient.  In  practice,  our  initial  implementa¬ 
tion  takes  a  few  seconds  per  image  to  compute 
depth.  For  example,  the  stereo  images  shown 
in  section  3  took  3  seconds  to  process  with  10 
disparities,  on  a  50-MHz  spare. 

3  Experimental  results 

We  have  used  our  methods  both  for  image 
restoration  and  for  stereo  and  motion.  For  im¬ 
age  restoration,  we  took  a  synthetic  image  of  a 
set  of  inscribed  diamonds  and  corrupted  it  with 
gaussian  noise  with  £7  =  6.  The  original  inten¬ 
sities  are  arranged  with  gaps  of  40  intensities 
between  adjacent  regions.  The  restored  image 
is  shown  in  figure  1.  98%  of  the  pixels  were  as¬ 
signed  the  correct  intensity  by  our  method.  For 
this  example  we  used  e  =  18. 

For  the  stereo  experiments  presented  below  we 
used  e  =  2  to  account  mostly  for  the  camera 
noise.  To  handle  other  measurement  errors  we 


Figure  1:  Image  restoration  of  a  synthetic  im¬ 
age 

introduced  gain  and  bias  parameters  g,b  that 
adjust  pixel  intensities  in  the  right  image.  Our 
model  of  image  formation  becomes 

j(^P)  =  ^g^l'{P  +  d)  +  b)  +  u{P).  (6) 

Connected  components  were  computed  for  all 
disparities  in  D  and  for  all  values  of  adjust¬ 
ment  factors  in  a  fixed  range.  For  all  real  im¬ 
ages  shown  here,  we  varied  g  from  0.8  to  1.2 
and  b  from  -18  to  18.  Finally,  for  each  pixel 
P  we  chose  disparity,  g,  and  b  corresponding 
to  the  largest  connected  component  containing 
P.  This  is  a  simple  way  to  deal  with  g  and  b 
parameters. 

There  are  usually  a  number  of  pixels  in  the  left 
image  which  are  occluded  in  the  right  image, 
and  the  attempt  to  find  a  match  for  such  pixels 
will  yield  wrong  answers.  There  is  an  easy  way 
to  cut  down  the  amount  of  such  errors  our  al¬ 
gorithm  produces.  Suppose  for  some  pixel  P  in 
the  left  image,  the  size  of  the  largest  connected 
component  P  is  consistent  with  is  smaller  than 
some  fixed  minimal  size.  Then  we  consider  pixel 
P  to  have  no  match  in  the  right  image.  We  fixed 
the  minimal  size  component  to  be  6  for  all  the 
result  shown  in  this  section. 

We  also  check  the  output  disparity  for  double 
assignments.  That  is  if  two  pixels  in  the  left 
image  are  mapped  to  the  same  pixel  in  the  right 
image,  then  the  pixel  which  belongs  to  a  bigger 
connected  component  gets  assigned  to  that  pixel 
in  the  right  image. 
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Figure  2  shows  the  left  image  of  a  random  dot 
stereogram.  In  the  right  image,  the  entire  image 
in  translated  uniformly.  Note  the  large  texture¬ 
less  region  in  the  center,  which  creates  problems 
for  existing  algorithms.  Our  method  performs 
correctly  for  this  image,  assigning  all  pixels 
the  correct  translation.  Our  adaptive  window 
scheme  ensures  that  the  entire  image  is  treated 
as  a  single  window,  thus  propagating  informa¬ 
tion  from  the  high-texture  regions  at  the  out¬ 
skirts  of  the  image  into  the  low-texture  regions 
in  the  middle  of  the  image.  Kanade  and  Oko- 
tumi’s  algorithm  [Kanade  and  Okutomi,  1994] 
may  also  succeed  in  this  situation. 


Figure  2;  Random  dot  stereogram  illustrating 
aperture  problem 


3.1  Results  on  Real  Images 

Figure  3  shows  the  left  image  of  the  meter  pair 
from  CMU  and  the  results  of  normalized  cor¬ 
relation  and  our  algorithm.  In  this  image  on 
the  wall  of  the  building  there  are  many  areas  of 
low  intensity  variation.  Normalized  correlation 
clearly  has  trouble  producing  correct  answers 
in  these  areas.  Even  if  we  use  a  large  window 
of  20  X  20  pixels,  there  are  still  a  lot  of  large 
bright  spots  on  the  wall,  which  were  obviously 
not  matched  correctly.  Our  algorithm  places 
most  of  the  wall  at  the  same  disparity.  We  don’t 
know  if  it  is  indeed  the  correct  answer,  but  nor¬ 
malized  correlation  not  only  places  most  pixels 
of  the  wall  at  this  disparity,  but  it  also  places 
large  patches  of  pixels  at  this  particular  dispar¬ 
ity  at  the  right  and  the  left  ends  of  the  wall. 

It’s  interesting  to  observe  how  the  algorithm 
works  for  the  car.  Almost  all  of  the  car  ex¬ 
cept  for  very  few  pixels  were  placed  at  the  same 
disparity.  The  edge  of  the  car  produced  by  our 
algorithm  is  very  sharp. 


Image  Our  results 

Figure  3:  Meter  results 
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Abstract 

Organisms  found  in  nature  exhibit  highly 
adaptable,  robust  visual  abilities,  hinting  that 
biologically  inspired  methods  such  as  artificial 
evolution  are  a  potentially  fruitful  path  to¬ 
ward  inducing  such  abilities  in  artificial  systems. 
However,  the  performance  of  search  in  any  given 
domain  is  often  highly  dependent  on  the  amount 
of  knowledge  of  the  particular  domain  that  is 
applied  to  guide  search.  This  fact  poses  mul¬ 
tiple  problems  for  artificial  evolution  in  visual 
task  domains.  At  the  same  time,  natural  evolu¬ 
tion  has  succeeded  despite  these  obstacles,  sug¬ 
gesting  that  it  may  be  possible  to  obtain  useful 
knowledge  of  the  domain  in  question  over  the 
course  of  evolution.  The  experiments  reported 
here,  conducted  in  the  domain  of  artificial  neu¬ 
ral  networks  for  face  recognition,  demonstrate 
that  useful  domain  knowledge  generated  as  a 
by-product  of  evolution  may  be  extracted,  ana¬ 
lyzed,  and  used  to  improve  further  search  in  the 
domain. 

1  Introduction 

Search  algorithms  patterned  after  biological 
evolution  are  attractive  for  use  in  domains  such 
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as  vision  that  have  complex  search  spaces  for  a 
number  of  reasons,  including:  (1)  Application  of 
them  doesn’t  explicitly  require  deep  insight  into 
the  domain;  (2)  they’re  relatively  straightfor¬ 
ward  to  parallelize;  and  (3)  their  natural  analog 
has  resulted  in  entities  of  extraordinary  com¬ 
plexity  and  robustness.  However,  the  perfor¬ 
mance  of  search  in  any  particular  domain  is 
highly  dependent  on  the  interaction  between  the 
chosen  representation  of  the  space  and  the  spe¬ 
cific  search  operators  employed,  and  for  evolu¬ 
tionary  algorithms  in  particular,  this  interaction 
is  a  poorly  understood  process,  leaving  practi¬ 
tioners  with  few  guidelines  as  to  how  to  make 
the  right  choices  to  yield  good  performance. 

One  popular  approach  to  improving  the  perfor¬ 
mance  of  search  in  a  particular  domain  is  to 
seek  to  incorporate  pre-existing  knowledge  of 
the  domain  into  the  operators  and  representa¬ 
tion.  However,  this  approach  is  problematic  for 
evolutionary  search  because  of  the  aforemen¬ 
tioned  opacity  of  the  interaction  between  the 
operators  and  the  representation.  This  diffi¬ 
culty,  popularly  known  as  “the  representation 
problem”,  is  only  compounded  in  more  com¬ 
plex  domains,  presenting  a  formidable  obstacle 
to  the  application  of  artificial  evolution  in  pre¬ 
cisely  those  domains  in  which  they  may  be  of 
the  greatest  utility. 

Moreover,  even  if  knowledge  useful  for  spe¬ 
cific  visual  tasks  of  interest  is  available,  “hard- 
coding”  any  available  domain  knowledge  into 
artificial  evolution  may  actually  inhibit  adapta¬ 
tion  when  domain  parameters  change.  There- 
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fore,  rather  than  seeking  how  pre-existing  do¬ 
main  knowledge  can  be  best  exploited  by  evo¬ 
lution,  our  research  is  directed  toward  the  au¬ 
tomatic  acquisition  of  such  knowledge  in  op¬ 
erational  form.  We  have  developed  domain- 
independent  techniques  for  learning  domain- 
adapted  control  knowledge  that  guides  evolu¬ 
tionary  search. 

The  experiments  reported  herein  demonstrate 
that  generalized  information  about  a  particu¬ 
lar  domain,  generated  over  the  course  of  evolu¬ 
tionary  search,  can  be  extracted,  analyzed,  and 
then  employed  to  improve  search  in  future  runs. 
Our  application,  automated  face  recognition,  is 
an  instance  of  the  general  signal-to-symbol  map¬ 
ping  problem  [Teller  and  Veloso,  1995].  In  par¬ 
ticular,  the  task  is  to  identify  which  of  five  in¬ 
dividuals  is  represented  in  a  given  visual  image. 

The  space  explored  is  the  weight  space  of  fixed- 
topology,  feed-forward  artificial  neural  networks 
(Ann’s).  Over  the  course  of  adaptation,  weight 
vectors  were  collected  along  with  their  self- 
adapted,  variable  mutation  rate  [e.g.  [Schwefel, 
1995],  [Fogel  et  al,  1991]).  These  data  were 
then  used  to  train  another  ANN  to  predict  the 
appropriate  mutation  rate  for  a  given  weight 
vector  for  the  face-recognition  domain  in  gen¬ 
eral.  Finally  the  mutation  rate-prediction  net¬ 
works  were  used  to  drive  evolution  on  another 
face  recognition  task,  resulting  in  networks  with 
improved  generalization  performance. 

2  The  Face  Recognition  Domain 

The  image  set  (based  on  data  and  code  pro¬ 
vided  by  Jeff  Shufelt  and  Tom  Mitchell,  avail¬ 
able  at  http://www.es. emu. edu/"tom/f aces, 
html)  consists  of  approximately  8  gray-scale  im¬ 
ages  of  each  of  20  people.  Each  person  is  pic¬ 
tured  facing  forward,  from  the  shoulders  up, 
making  a  variety  of  facial  expressions,  both  with 
sunglasses  and  without. 

The  images  were  scaled  to  30x32  pixels  of  one 
byte  each,  yielding  a  total  of  960  bytes  which 
are  then  provided  as  input  to  a  feed-forward  ar¬ 
tificial  neural  network  with  a  single  hidden  layer 
and  an  output  layer,  each  consisting  of  five  sig¬ 
moid  units. 


Target  output  0.1  0.9  0.1  0.1  0.1 


Figure  1:  Face  recognition  artificial  neural 
network 


We  divided  the  image  set  of  the  20  people  into 
four  task  sets,  51  -  54,  of  five  people  each.  Each 
task  set  was  further  divided  into  training  and 
testing  sets,  with  5  images  of  each  individual 
allocated  to  training,  and  2  or  usually  3  to  test¬ 
ing.  For  each  task  set,  a  single  output  unit  was 
designated  as  a  “recognizer”  for  each  person  in 
the  set.  When  presented  with  an  image  of  a 
given  person,  the  target  output  for  the  network 
was  0.9  from  the  recognizer  output  unit,  and  0.1 
from  the  rest  (see  figure  1). 

3  Evolving  Face  Recognition 
Networks 

Artificial  evolution,  typified  by  such  paradigms 
as  Evolutionary  Programming  ([Fogel  et  al, 
1966], [Fogel,  1995]),  Evolution  Strategies 
([Rechenberg,  1973], [Schwefel,  1995]),  and 
Genetic  Algorithms  ([Holland,  1975], [Goldberg, 
1989]),  may  be  viewed  as  the  alternating, 
iterative  application  of  two  processes  to  a 
population  of  candidate  objects: 
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•  the  selection  process  transforms  the  pop¬ 
ulation  to  one  in  which  the  most  promis¬ 
ing  candidates  tend  to  be  represented  in 
greater  frequency,  thus  boosting  the  mean 
quality  of  the  population 

•  the  variation  process  then  stochastically 
perturbs  individuals  in  the  population  in 
some  manner,  with  the  hope  of  producing 
individuals  of  higher  quality  than  any  yet 
found. 

The  specific  instantiation  of  artificial  evolution 
employed  here  most  closely  matches  that  of  the 
Evolution  Strategies  paradigm,  in  which  the 
principal  source  of  variation  is  Gaussian  muta¬ 
tion,  implemented  with  self- adapting,  variable 
mutation  rates. 

3.1  Phase  1  -  Collection  of  Domain 
Knowledge 

In  the  first  phase  of  our  experiments,  several 
runs  of  the  evolutionary  algorithm  were  con¬ 
ducted,  each  of  which  adapted  face  recognition 
networks  (FRN’s)  to  perform  the  recognition 
task  for  one  of  the  task  sets  51,  52,  or  53. 

Each  run  iteratively  transformed  a  population 
of  100  individual  FRN’s.  Each  FRN  i  was  repre¬ 
sented  as  a  vector  of  real  values,  with  one  value 
for  each  weight  in  the  network,  wn  . .  and 
one  additional  value  for  the  mutation  rate  (see 
below),  mi.  In  the  initial  population,  all  the 
weight  values  were  initialized  to  0,  and  the  mu¬ 
tation  rates  were  set  to  0.1. 

3.1.1  The  Selection  Process 

The  selection  process  presented  each  FRN  with 
each  of  the  training  images  from  the  target  task 
set  for  the  run  in  question  as  input,  and  assigned 
each  network  a  fitness  value  equal  to  its  sum  of 
the  squared  error  (SSE)  with  respect  to  each 
of  the  target  output  vectors  (0.9  for  the  output 
unit  assigned  to  recognize  the  person  depicted  in 
the  input  image,  and  0.1  for  the  rest)  for  each 
image.  A  new  population-with  a  lower  mean 
SSE-was  then  made  by  copying  each  of  the  10 
FRN’s  with  the  lowest  SSE  on  the  training  im¬ 
ages  10  times.  This  form  of  selection  is  known 


as  {fi,  X)-selection,  where  fi  (10  in  this  case)  is 
the  number  of  “parents”  chosen  to  reproduce 
to  form  a  new  population  of  size  A  (100  in  this 
case). 

3.1.2  The  Variation  Process 

In  the  variation  process,  each  value  in  each 
weight  vector  (each  vector  representing  a  FRN) 
was  mutated  using  Gaussian  noise.  As  is  typical 
of  the  Evolution  Strategies  paradigm,  the  muta¬ 
tion  rate  values  were  each  multiplied  by  a  value 
sampled  from  an  exponential  Gaussian  distribu¬ 
tion  with  a  mean  of  0  and  a  standard  deviation 
of  0.1.  For  each  individual  FRN,  its  resulting 
mutation  rate  was  then  used  as  the  standard 
deviation  of  another  Gaussian  with  a  mean  of 
0.  For  each  weight  value  of  the  FRN  in  ques¬ 
tion,  a  value  was  sampled  from  this  distribution 
and  added  to  the  weight: 

m(-  =  mi  •  exp{Gaussian{n  =  0,  ct  =  0.1) 

w'ij  =  Wij  Gaussian{fM  =  0,  cr  =  m^), 
for  j  =  1  •  •  •  n 

Evolutionary  algorithms  such  as  this  one  have 
previously  been  shown  to  produce  good  results 
for  training  ANN’s  {e.g.  [Porto  et  al,  1995], 
[Baluja,  1995],  [Glickman  and  Sycara,  1996]). 
The  minimum  SSE  in  the  population  over  time 
for  both  the  training  and  test  sets  (SSE  was 
measured  for  the  test  set  to  observe  the  change 
of  generalization  performance  over  time,  but 
only  SSE  on  the  training  set  was  used  for  the 
purposes  of  the  selection  process*)  is  shown  for 
a  typical  run  in  figure  2. 

Training  was  stopped  after  500  generations, 
both  for  purposes  of  expediency  and  because 
we  determined  that  adaptation  had  typically 
slowed  to  a  relatively  insignificant  rate  by  this 
point.  Moreover,  in  the  great  majority  of  cases 
(including  the  run  shown  in  figure  2),  despite 
the  fact  that  the  SSE  with  respect  to  the  tar¬ 
get  output  vectors  was  still  above  zero,  by  500 

*The  SSE  depicted  for  the  test  set  at  each  generation 
is  for  the  same  FRN  for  which  the  SSE  on  the  training 
set  is  shown,  i.e.  the  one  in  the  population  with  the 
lowest  SSE  on  the  trciining  set.  This  FRN  is  thus  not 
necesscirily  the  one  with  strictly  the  lowest  SSE  on  the 
test  set  in  the  population. 
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log  population  minimum  SSE 


Figure  2:  A  typical  run  on  task  set  Si 


generations  all  of  the  images  in  both  training 
and  test  sets  were  found  to  have  long  since  been 
properly  classified  (with  classification  being  de¬ 
fined  as  the  designated  output  unit  producing 
a  value  above  0.5,  while  the  others  producing 
values  below  0.5). 

3.1,3  Self-adapting  mutation  rates 

As  stated  above,  the  mutation  rates  for  each 
FRN  are  themselves  subject  to  mutation.  De¬ 
spite  the  fact  that  the  mutation  rates  are  not 
directly  measured  by  the  fitness  function,  selec¬ 
tion  results  in  their  adaptation  as  well,  via  an 
indirect  mechanism  called  self-adaptation  in  the 
Evolutionary  Computation  literature. 

A  FRN  that  is  selected  for  the  next  generation 
has  10  “offspring”.  Each  of  its  offspring  inher¬ 
its  its  parent’s  mutation  rate,  which  is  then  mu¬ 
tated,  and  then  used  to  mutate  its  weights.  Se¬ 
lection  is  thus  seen  to  work  on  mutation  rates 
via  favoring  those  individuals  with  mutation 
rates  that  resulted  in  a  favorable  mutation  to 
their  own  weights.  Given  two  different  net¬ 
works,  the  “best”  mutation  rate-  the  one  most 
likely  to  produce  a  favorable  mutation  in  one 
network-may  not  be  the  same  as  the  “best” 
mutation  rate  for  the  other.  Selection  will  fa¬ 
vor  them  depending  how  well  their  mutation 
rates  are  adapted  to  their  own  specific  posi¬ 
tion  in  weight  space,  hence  the  “self”  in  self¬ 
adaptation. 


Figure  3:  Adaptation  of  mutation  rates  for  a 
typical  run  on  task  set  SI 


At  the  same  time,  it  is  a  general  rule  that  the 
better  adapted  a  particular  individual  is-in  this 
case,  the  lower  the  SSE  on  the  training  set-the 
more  likely  mutation  is  to  degrade  its  fitness, 
and  the  higher  the  mutation  rate,  the  greater 
the  expected  extent  of  degradation.  Thus,  in¬ 
creasingly  lower  mutation  rates  tend  to  be  fa¬ 
vored  overall  as  adaptation  proceeds  (although 
not  exclusively)  as  is  shown  in  figure  3  (for  the 
same  sample  run  as  in  figure  2) . 

Figure  3  also  illustrates  two  other  important 
points.  One  is  that  selection  on  mutation  rates 
is  noisier  than  that  on  SSE.  This  is  to  be  ex¬ 
pected  because  the  rates  themselves  are  not  di¬ 
rectly  measured  by  selection,  and  because  even 
a  poor  rate  has  a  chance  of  producing  good  off¬ 
spring  while  a  good  rate  can  produce  poor  off¬ 
spring.  Another  observation  to  make  is  that 
while  the  general  tendency  is  for  mutation  rates 
to  decrease,  this  tendency  is  not  necessarily 
monotonic.  The  average  population  mutation 
rate  in  figure  3  is  actually  seen  to  rise  for  a  pe¬ 
riod  lasting  from  approximately  generations  200 
-  375.  During  this  same  period,  figure  2  shows 
that  the  population  minimum  SSE  on  the  train¬ 
ing  images  dropped  significantly  after  a  period 
of  relative  stagnation.  One  may  therefore  infer 
that  during  this  period  the  networks  in  the  pop¬ 
ulation  discovered  a  region  of  the  weight  space 
in  which  higher  mutation  rates  enabled  them 
to  adapt  more  rapidly.  The  purpose  of  noting 
this  occurrence  is  to  draw  attention  to  the  fact 
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that  while  the  overall  trend  is  for  mutation  rates 
to  decay  over  time  {e.g.  see  the  line  depicting 
self-adapting  mutation  rates  averaged  over  sev¬ 
eral  runs  in  figure  5  below),  at  any  particular 
point  in  any  particular  run,  self-adapting  muta¬ 
tion  rates  are  sensitive  to  the  local  contours  of 
the  fitness  surface. 

Thus,  the  mutation  rates  appear  as  clues  to  a 
sort  of  noisy  map  of  the  fitness  surface.  How¬ 
ever,  because  the  rates  do  decay  as  a  local  opti¬ 
mum  is  approached,  the  information  isn’t  usu¬ 
ally  retained  at  the  end  of  the  run.  To  see 
whether  generalized  domain  information  might 
be  extracted  from  these  traces,  we  collected  a 
set,  S,  of  the  best  weight  vectors  from  each 
generation  along  with  their  associated  mutation 
rates  over  several  runs  on  each  of  task  sets  51, 
52,  and  53.  In  the  next  phase  of  these  experi¬ 
ments,  we  attempted  to  use  these  data  to  learn 
to  predict,  given  a  weight- vector  representing  a 
FRN,  a  mutation  rate  that  would  be  suitable 
for  mutating  this  vector. 

3.2  Phase  2  -  Analysis  of  Domain 
Knowledge 

In  the  second  phase,  we  again  trained  ANN’s. 
This  time,  however,  the  networks  trained  were 
mutation  rate  prediction  networks  (MRPN’s), 
mapping  FRN  weight-vectors  to  appropriate 
mutation  rates.  The  MRPN’S  were  assigned 
three  hidden  units,  and  trained  using  backprop- 
agation  for  speed.  It  is  important  to  point  out 
that  our  goal  was  simply  to  determine  whether 
these  data  were  in  fact  learnable  to  any  useful 
extenf^. 

After  periodic  sampling  to  reduce  the  volume 
of  training  examples  to  a  reasonable  number, 
we  allocated  approximately  two-thirds  of  the 
dataset  S  collected  in  phase  1  to  training,  and 
the  remaining  one-third  to  testing,  taking  care 
that  the  data  allocated  to  training  and  testing 
were  from  separate  runs  in  order  to  avoid  over¬ 
fitting  particular  runs.  The  weight  vectors  were 
then  presented  as  input  and  their  associated 

*The  choice  of  ANN’s  for  this  phase  was  essenticilly 
airbitrciry;  2iny  one  of  a  host  of  other  function  approxi¬ 
mators  may  quite  likely  yield  better  performance  for  the 
mutation  rate  prediction  task. 


mutation  rates  were  used  as  the  target  output. 
We  stopped  training  after  overfitting  began  to 
occur,  and  saved  the  MRPN’s  with  the  lowest 
error  on  the  test  data  for  use  in  the  third  phase 
of  the  experiment. 

After  training,  the  MRPN’s  had  learned  to  pre¬ 
dict,  for  a  given  FRN  weight  vector,  a  mutation 
rate  at  which  it  would  be  advantageous  to  mu¬ 
tate  the  given  vector  in  order  to  achieve  adap¬ 
tation.  Because  the  MRPN’s  were  trained  from 
data  collected  from  a  variety  of  runs  conducted 
on  multiple  different  task  sets,  to  the  extent 
to  which  an  MRPN  was  successful  at  guiding 
search  on  another  face  recognition  task  set  on 
which  it  hadn’t  been  trained,  it  could  be  said 
to  embody  some  form  of  knowledge  about  the 
FRN  domain  in  general.  In  phase  3,  we  con¬ 
ducted  experiments  to  determine  whether  this 
was  in  fact  the  case. 

3.3  Phase  3  -  Use  of  Domain 
Knowledge 

In  the  final  phase,  face  recognition  networks 
were  once  again  evolved,  but  this  time  using 
task  set  54.  This  time,  instead  of  mutating  the 
selected  weight  vectors  with  an  associated  mu¬ 
tation  rate,  each  selected  vector  was  provided 
as  input  to  one  of  the  MRPN’s  trained  in  phase 
2.  The  resulting  mutation  rate  was  then  used 
to  mutate  the  connection  weights: 

mi  =  MRPN{wi) 
w'-  =  Wij  +  Gaussian{p,  =  0,  a  =  m,), 
for  j  =  1  •  ■  ■  n 

The  mutation  rates  in  these  runs  were  there¬ 
fore  no  longer  being  self-adapted,  but  rather 
determined  by  the  MRPN’s  learned  in  phase  2. 
For  the  purposes  of  comparison,  a  set  of  runs 
with  self-adapting  rates,  analogous  to  those  con¬ 
ducted  in  phase  1,  were  also  conducted  on  54. 

4  Results 

Figure  4  shows  the  log  minimum  population 
SSE  on  the  task  54  test  set,  averaged  by  gen¬ 
eration,  over  the  first  five  runs  achieved  using 
two  separate  MRPN’s  trained  in  phase  2  to  de¬ 
termine  the  mutation  rates.  Average  results 
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Figure  4:  Average  log  population  minimum 
SSE  on  the  task  54  test  set  over  time 


Figure  5:  Average  population  mutation  rate 
on  task  54  over  time 


for  runs  using  self-adapting  mutation  rates  con¬ 
ducted  on  task  S4,  analogous  to  those  conducted 
in  phase  1,  are  shown  for  comparison.  Use  of 
the  MRPN’s  over  these  initial  runs  showed  a 
non-trivial  improvement  in  generalization  per¬ 
formance  on  the  face  recognition  task  as  mea¬ 
sured  by  performance  on  the  test  images.  Some 
measure  of  insight  as  to  the  source  of  this  im¬ 
provement  may  be  gleaned  from  figure  5. 

Most  significantly,  note  that  the  population 
average  mutation  rate  is  indeed  attenuated 
as  adaptation  proceeds,  signifying  that  the 
MRPN’s  are  indeed  managing  to  recognize  FRN 
vectors  that  are  more  adapted,  and  specify  a 
lower  mutation  rate  as  is  appropriate.  More¬ 
over,  because  the  MRPN’s  were  trained  on  data 


collected  from  runs  adapting  FRN’s  on  task  sets 
other  than  the  ones  examined  here,  their  mu¬ 
tation  rate  “recommendations”  must  be  being 
made  based  on  generalized  FRN  domain  knowl¬ 
edge,  rather  than  specific  knowledge  of  task  set 
54. 

Despite  the  fact  that  some  domain  knowledge 
has  apparently  been  acquired,  figure  5  does 
show  a  substantial  difference  between  the  av¬ 
erage  rates  specified  by  the  MRPN’s  and  those 
found  via  self-adaptation.  If  the  self-adapted 
rates  are  in  fact  well-adapted,  why  do  the  dif¬ 
fering  rates  specified  by  the  MRPN’s  result  in 
an  advantage  (i.e.  lower  SSE)  with  respect  to 
generalization  performance  (figure  4).  The  an¬ 
swer  is  unknown  at  this  time,  however  one  hy¬ 
pothesis  is  that  a  run  employing  self-adapting 
rates  is  spending  a  significant  amount  of  its 
search  capacity  per  generation  exploring  the 
space  of  mutation  rates.  In  other  words,  with 
self-adapting  rates,  FRN’s  are  continually  being 
spawned  with  mutation  rates  that  are  either  too 
high  or  too  low  to  result  in  a  reasonable  rate 
of  adaptation.  In  runs  employing  the  MRPN’s, 
mutation  rates  are  determined  in  a  much  more 
deterministic  manner,  focusing  adaptation  on 
the  FRN’s  rather  than  on  their  mutation  rates 
themselves. 

5  Conclusion 

Many  related  questions  remain  to  be  answered. 
In  particular,  the  question  of  how  to  best  ap¬ 
proximate  the  function  mapping  weight  vectors 
to  mutation  rates  for  this  domain  remains  wide 
open.  Ann’s  and  backpropagation  were  used  in 
this  case,  but  we  know  of  no  reason  to  expect 
them  to  perform  better  than  any  other  func¬ 
tion  approximator.  Moreover,  MRPN’s  trained 
on  differing  samples  of  the  phase  1  data  differ 
in  their  ability  to  control  subsequent  searches. 
There  clearly  remains  a  great  deal  of  exploration 
to  do  with  respect  to  both  how  to  sample  the 
data  collected  in  phase  1  to  yield  the  most  co¬ 
herent  training  signal,  as  well  as  how  domain 
knowledge  can  be  best  generalized  from  this  sig¬ 
nal.  Perhaps  most  interesting  is  the  question 
of  what  sort  of  knowledge  has  actually  been 
learned.  Have  the  MRPN’s  learned  some  sort  of 
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adaptive  mutational  “cooling”  schedule,  speci¬ 
fied  by  the  distance  the  FRN  vectors  have  de¬ 
viated  from  their  initialized  state?  How  general 
is  the  learned  control  knowledge?  How  useful 
is  it  in  guiding  search  for  effective  face  recogni¬ 
tion  in  similar  problem  classes?  These  questions 
are  among  those  we  are  addressing  in  ongoing 
research. 

At  the  same  time,  these  preliminary  results 
demonstrate  that  it  is  indeed  feasible  to  auto¬ 
matically  adapt  evolutionary  search  to  improve 
performance  in  a  particular  domain.  More¬ 
over,  adaptation  to  the  domain  is  accomplished 
via  analysis  of  a  training  signal  produced  as  a 
byproduct  of  evolution  itself,  suggesting  that 
this  entire  process  has  the  potential  to  be  car¬ 
ried  out  by  evolution  directly,  without  the  need 
for  explicit  “phases”  (such  as  employed  above) . 
These  findings  indicate  a  promising  direction  for 
artificial  evolution  in  complex  domains  such  as 
image  understanding. 
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Abstract^ 

Recognizing  landmark  is  a  critical  task  for  mobile 
robots.  Landmarks  are  used  for  robot  positioning, 
and  for  building  maps  of  unknown  environments. 
In  this  context,  the  traditional  recognition  tech¬ 
niques  based  on  strong  geometric  models  cannot 
be  used.  Rather,  models  of  landmarks  must  be  built 
from  observations  using  image-based  visual  learn¬ 
ing  techniques.  Beyond  its  application  to  mobile 
robot  navigation,  this  approach  addresses  the  more 
general  problem  of  identifying  groups  of  images 
with  common  attributes  in  sequences  of  images. 
We  show  that,  with  the  appropriate  domain  con¬ 
straints  and  image  descriptions,  this  can  be  done 
using  efficient  algorithms  as  follows:  Starting  with 
a  “training”  sequence  of  images,  we  identify 
groups  of  images  corresponding  to  distinctive  land¬ 
marks.  Each  group  is  described  by  a  set  of  feature 
distributions.  At  ran-time,  the  observed  images  are 
compared  with  the  sets  of  models  in  order  to  recog¬ 
nize  the  landmarks  in  the  input  stream. 

1.  Introduction 

Recognizing  landmarks  in  sequences  of  images  is  a 
challenging  problem  for  a  number  of  reasons.  First 
of  all,  the  appearance  of  any  given  landmark  varies 
substantially  from  on  observation  to  the  next.  In 
addition  to  variation  due  to  different  aspects,  illu- 
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mination  change,  external  clutter,  and  changing 
geometry  of  the  imaging  devices  are  other  factors 
affecting  the  variability  of  the  observed  landmarks. 
Finally,  it  is  typically  difficult  to  use  accurate  3D 
information  in  landmark  recognition  applications. 
For  those  reasons,  it  is  not  possible  to  use  many  of 
the  object  recognition  techniques  based  on  strong 
geometric  models. 

The  alternative  is  to  use  image-based  techniques  in 
which  landmarks  are  represented  by  collection  of 
images  which  are  supposed  to  capture  the  “typical” 
appearance  of  the  object.  The  information  most  rel¬ 
evant  to  recognition  is  extracted  from  the  collection 
of  raw  images  and  used  as  the  model  for  recogni¬ 
tion.  This  process  is  often  referred  to  as  “visual 
learning”. 

Progress  has  been  made  recently  in  developing 
such  approaches.  For  example,  in  object 
modeling  [Gros  et  al.],  2D  or  3D  model  of  objects 
are  built  for  recognition  applications.  An  object 
model  is  built  by  extracting  features  from  a  collec¬ 
tion  of  observations.  The  most  significant  features 
are  extracted  for  the  entire  set  and  are  used  in  the 
model  representation.  Extension  to  generic  object 
recognition  were  presented  recently  [Carlsson, 
1996]. 

Other  recent  approaches  use  the  images  directly  to 
extract  a  small  set  of  characteristic  images  of  the 
objects  which  are  compared  with  observed  views  at 
recognition  time.  For  example,  the  eigen-images 
techniques  are  based  on  this  idea. 

Those  approaches  are  typically  used  for  building 
models  of  single  object  observed  in  isolation.  In 
the  case  of  landmark  recognition  for  navigation. 
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there  is  no  practical  way  to  isolate  the  object  in 
order  to  build  models.  Worse,  it  is  often  not  known 
in  advance  which  of  the  objects  observed  in  the 
environment  would  constitute  good  landmarks. 

A  similar  problem,  although  in  a  different  context, 
is  encountered  in  image  indexing,  where  the  main 
problem  is  to  store  and  organize  images  to  facili¬ 
tate  their  retrieval  [Lamiroy  et  al.,  1996]  [Schmid 
et  al.,  1996].  The  emphasis  in  this  case  is  the  kind 
of  features  used  and  the  type  of  requests  that  can  be 
made  by  the  user. 

Our  approach  tries  to  combine  these  two  catego¬ 
ries  of  systems.  In  a  training  stage,  the  system  is 
given  a  set  of  images  in  sequence.  The  aim  of  the 
training  is  to  organize  these  images  into  groups 
based  on  similarity  of  feature  distributions  between 
images.  The  size  of  the  groups  obtained  may  be 
defined  by  the  user,  or  by  the  system  itself.  In  the 
latter  case,  the  system  tries  to  finds  the  most  rele¬ 
vant  groups,  taking  the  global  distribution  of  the 
images  into  account.  In  a  second  step,  the  system  is 
given  new  images,  which  it  tries  to  classify  as  one 
of  the  learned  groups,  or  in  the  category  of  unrec¬ 
ognized  images. 

The  basic  representation  is  based  on  distributions 
of  different  feature  characteristics.  All  these  differ¬ 
ent  kinds  of  histograms  are  computed  for  the  whole 
image  and  for  a  set  of  sub-images.  Tests  are  used  to 
compare  these  histograms  and  define  a  distance 
between  images.  This  distance  is  then  used  to  clus¬ 
ter  the  images  into  groups.  Each  group  is  then  char¬ 
acterized  by  a  set  of  feature  histograms.  When  new 
images  are  given  to  the  system,  the  algorithm  eval¬ 
uates  a  distance  between  these  images  and  the 
groups.  The  system  determines  to  which  group  this 
image  is  the  closest,  and  a  set  of  thresholds  is  used 
to  decide  if  the  image  belongs  to  this  group. 

The  main  goal  of  the  work  presented  here  was  to 
explore  the  use  of  tools  and  methods  in  the  field  of 
image  retrieval  to  the  problem  of  landmark  recog¬ 
nition.  It  is  clear  that  the  global  architecture  of  the 
system  is  close  to  that  of  object  recognition 
systems  [Gros  et  al.]:  a  training  stage  in  which  3D 
shape,  2D  aspects,  or  groups,  are  characterized  is 
followed  by  a  recognition  stage  in  which  this  infor¬ 
mation  is  used  to  recognize  the  models,  objects  or 
groups  in  new  images.  The  difference  comes  from 
the  wide  diversity  of  the  images  and  from  the 


groups  which  are  not  reduced  to  a  single  aspect  of 
an  object.  The  two  challenging  tasks  which  we 
concentrate  on  in  the  remainder  of  the  paper  are  to 
define  these  groups  more  precisely  as  sets  of 
images,  and  to  build  automatically  a  characteriza¬ 
tion  for  each  group. 

The  first  section  of  the  paper  deals  with  the  feature 
distributions,  their  computation  and  their  compari¬ 
son.  The  second  section  addresses  the  computation 
and  the  characterization  of  groups.  The  third  sec¬ 
tion  concerns  the  classification  of  new  images 
according  to  the  groups  previously  defined.  Experi¬ 
mental  data  and  results  are  presented  in  the  fourth 
section. 


Figure  1:  Sample  set  of  images  from  a  typical 
training  sequence.  The  complete  training 
sequence  contains  176  images  taken  over 
approximately  800  meters. 

2.  Representing  Images 

In  this  section,  we  give  a  brief  overview  of  the  fea¬ 
tures  used  for  representing  images.  Since  an  indi¬ 
vidual  feature  can  be  characteristic  of  an  aspect  of 
an  object,  but  probably  fails  to  characterize  well  a 
set  of  aspects,  we  use  a  statistical  description  of  a 
large  number  of  features  as  our  basic  representa¬ 
tion  Representing  Feature  Distributions 

Five  basic  features  are  currently  used:  distribu¬ 
tions  of  normalized  color  and  intensity,  edges,  seg¬ 
ments,  and  parallel  segments.  Additional  features 
can  be  added  as  needed.  The  basic  image  represen¬ 
tation  is  a  set  of  feature  distributions.  Edges  are 
computed  using  Deriche’s  edge  detector  [Deriche, 
1987].  Segments  are  computed  as  a  polygonal 
approximation  of  the  edges  [Horaud  et  al.,  1990]. 


Among  the  characteristics  computed  from  these 
five  features  are:  color  and  grey  levels;  edge  den¬ 
sity,  orientation,  length  and  position;  segment  den¬ 
sity,  orientation,  length,  and  position;  parallel 
segment  density,  orientation,  length,  and  position, 
and,  finally,  the  angles  between  adjacent  segments. 

Feature  densities  are  computed  in  two  ways:  first 
as  the  ratio  of  the  number  of  features  (edges,  seg¬ 
ments  and  parallel  segments,  respectively)  to  the 
area  of  the  image  or  sub-image  concerned;  second 
as  the  ratio  of  the  sum  of  all  the  feature  lengths  to 
the  area  of  the  image  or  sub-image.  All  the  other 
measurements  are  computed  and  the  results  stored 
in  histograms.  Each  bucket  of  the  length  histo¬ 
grams  indicates  how  many  features  a  length  has  in 
a  given  interval. 

The  position  of  a  particular  object  in  an  image  may 
vary  substantially  between  observations.  There¬ 
fore,  it  is  important  to  build  the  representation  in  a 
way  that  allows  for  different  placements  of  the 
object  with  similar  resulting  feature  distributions. 
This  is  done  by  subdividing  the  image  into  smaller 
chunks,  in  which  the  feature  distributions  are  com¬ 
puted.  All  these  histograms  and  densities,  except 
those  relative  to  feature  position,  are  computed  for 
the  whole  image  and  for  the  sub-images  obtained 
by  dividing  the  image  by  4,  then  9,  and  then  16. 
The  position  histograms  are  computed  only  for  the 
global  image,  i.e.,  90  densities  and  333  histograms 
are  computed  to  characterize  each  image. 

2.1.  Comparing  Feature  Distributions 

The  feature  distributions  from  two  images  are 
compared  using  a  distance  similar  to  the  chi-square 
distance.  This  distance,  in  its  simplest  form,  evalu¬ 
ates  the  probability  that  two  sets  of  data,  here  the 
histograms  or  the  densities,  are  derived  from  the 
same  theoretical  distribution.  If  the  distributions 
are  h  =  (hi)  and  /  =  (/,),  their  difference  is  computed 
as  [Press  et  al.,  1992]:  d(/i,  /)  =  +  /,). 

The  main  problem  is  to  derive  a  global  distance 
between  two  images  from  the  individual  distances 
computed  for  each  type  of  density  and  histogram. 
The  distance  dy  between  two  images  i  and  j  is 
defined  as  the  linear  combination  of  the  distances 
between  individual  feature  distributions: 
d(/i,  [),  where  the  sum  is  taken  over  all  the  feature 


distributions  used  for  representing  the  images. 

When  nothing  is  known  about  the  distributions 
and  their  range  of  variation,  all  the  weights  can  be 
taken  equal  to  one.  This  simple  approach  gives 
good  results  in  practice,  but  a  better  approach  is  to 
compute  the  weights  based  on  the  relative  scales  of 
the  feature  distributions.  For  each  kind  of  density 
or  histogram,  the  distance  between  every  pair  of 
images  is  computed,  and  the  variance  of  these 
distances  is  derived.  The  coefficient  relative  to  this 
particular  distribution  can  be  chosen  as  =  l/cT^jt- 
This  choice  of  weights  has  the  effect  of  normaliz¬ 
ing  the  distributions  and  of  assigning  the  same  rela¬ 
tive  importance  to  all  the  partial  distances  used. 

3.  From  Images  to  Landmarks 

Given  a  sequence  of  images,  we  now  want  distinc¬ 
tive  landmarks,  that  is,  we  want  to  split  the 
sequence  into  groups  of  images,  and  find  a  charac¬ 
terization  of  each  of  these  groups  which  allows  fur¬ 
ther  classification. 

This  step  is  difficult  to  do  fully  automatically  in 
general.  The  main  reason  is  that  there  is  not  a  task- 
independent  definition  of  the  type  of  image  groups 
that  are  needed.  Our  approach  is  to  use  task  con¬ 
straints  to  guide  the  grouping  process.  Specifically, 
given  an  initial  grouping  of  images,  we  select 
groups  based  on  three  constraints.  First,  only  the 
groups  that  contain  a  large  enough  number  of 
images  from  different  aspects  are  retained.  Second, 
groups  that  do  not  provide  significant  discrimina¬ 
tion  are  discarded.  This  is  important  to  ensure  that, 
at  recognition  time,  only  the  groups  that  can  be 
easily  distinguished  are  used  as  models.  Finally, 
the  recorded  sensor  position  for  each  training 
image  is  used  for  ensuring  that  the  groups  are  spa¬ 
tially  coherent. 

3.1.  Computing  Initial  Image  Groups 

Once  the  distance  matrix  is  computed,  a  simple 
agglomerative  method  is  used  to  split  the  image  set 
into  initial  groups.  First  each  image  is  put  in  a  dif¬ 
ferent  group.  Then  the  two  closest  groups  are 
grouped  and  the  distance  matrix  is  updated. 
Finally,  the  algorithm  iterates  the  previous  step 
until  an  ending  condition  is  verified. 
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Let  !LI  denote  the  number  of  elements  of  the  image 
group  L.  The  update  of  the  matrix  consists  of  sup¬ 
pressing  the  two  lines  and  two  columns  i  andj  cor¬ 
responding  to  the  groups  /  and  J  which  are 
grouped,  and  then  adding  a  new  line  and  column 
for  the  new  group  formed.  The  diagonal  term  of  the 
new  line  added  is: 


of  the  distance  matrix  with  respect  to  the  number 
of  diagonal  terms. 

In  the  case  where  the  images  are  known  not  to 
have  a  uniform  distribution  in  a  region  of  their  rep¬ 
resentation  space,  this  can  be  taken  into  account  by 
using  these  other  formulas: 


|/|(|/|-i)  +  W(W-i)  +  2|/IW  '' 

The  non  diagonal  term  corresponding  to  the  new 
group  and  a  group  k  is: 


|/| 

1/U71 


^ik 


+ 


-^d 

i/ujr^ 


(2) 


These  formulas  show  that,  at  each  iteration,  the 
only  information  needed  is  the  distance  matrix  and 
the  number  of  elements  in  each  group.  Each  term 
of  the  matrix  is  thus  the  mean  distance  between  the 
images  of  two  groups,  or  the  mean  distance  of  the 
images  of  a  same  group. 


3.2.  Controlling  the  Grouping  Algorithm 


The  grouping  algorithm  described  above  is  gen¬ 
eral.  In  particular,  it  does  not  incorporate  a  control 
structure  that  stops  the  grouping  process  when 
groups  of  images  corresponding  to  recognizable 
landmark  are  formed.  An  automatic  method  was 
developed  for  controlling  image  grouping. 


Given  a  set  of  image  groups,  let  us  denote  the  dis¬ 
tance  matrix  by  (dij)  and  the  number  of  images  of 
the  group  i  by  nj.  If  the  images  used  to  learn  the 
groups  form  a  representative  sample,  and  if  they 
are  spread  nearly  uniformly  in  representation 
space,  the  probability  that  an  unknown  image  will 
be  classified  in  the  group  i  (p,)  or  in  no  group  at  all 
(Pq)  can  be  evaluated.  If  n  denotes  the  number  of 
groups: 


d:: 


Pi 


Pq  = 


■  1 

1*1^ _ 


43) 


1 2^  4* 

^i*k 


j  "  J*K-  J 

These  formulas  state  that  the  probability  p,-  that  a 
new  image  belongs  to  a  group  is  proportional  to  the 
size  dll  of  group,  and  that  the  probability  pq  of 
being  in  no  group  is  proportional  to  the  distances 
dji^  between  the  groups.  The  factor  2/(n-l)  is  used 
to  compensate  the  number  of  non-diagonal  terms 


Pi  = 


Po  = 


tiidii 


YCw  -H  \  )d..-¥ - r  T  djK 


”  '■  i*k 


(4) 
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In  these  new  formulas,  not  only  the  size  of  the 
groups  is  taken  into  account  but  also  their  density, 
which  is  proportional  to  n,-.  The  probability  pq  is 
also  a  function  of  the  size  of  the  groups:  for  the 
same  distances  between  the  groups,  the  smaller  the 
groups,  the  bigger  their  density,  and  the  smaller  is 
the  probability  of  having  a  new  image  between 
them. 


The  evaluation  function  is  the  entropy  associated 
with  this  set  of  probabilities  5=  -Z,  p,-  In  p,-,  and  the 
process  is  stopped  when  this  entropy  is  maximal. 
This  maximizes  the  information  provided  to  the 
user  by  each  classification  request. 


Figure  2:  Three  of  the  groups  extracted  from  the 
training  sequence  of  Figure  1.  Only  three 
images  are  shown  for  each  group. 
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3.3.  Using  Domain  Constraints 


An  important  constraint  in  the  context  of  landmark 
recognition  is  that  the  images  are  ordered  in  the 
training  sequence.  In  fact,  the  position  of  the  vehi¬ 
cle  is  recorded  for  each  image.  Using  this  informa¬ 
tion,  grouping  is  limited  to  images  for  which  the 
vehicle  positions  were  close  to  each  other,  thus 
ensuring  spatial  consistency. 

In  general,  given  an  image,  it  would  be  necessary 
to  consider  other  images  for  grouping  in  a  radius 
around  the  corresponding  vehicle  position.  In  prac¬ 
tice,  images  in  the  training  sequence  are  digitized 
at  approximately  equal  intervals.  As  a  result,  it  suf¬ 
ficient  to  consider  for  grouping  with  image  i  only 
images  j  such  that  li  -  7I  <  r,  where  r  is  the  maxi¬ 
mum  extent  of  observability  of  any  given  land¬ 
mark.  This  constraint  reduces  the  computational 
complexity  of  the  grouping  algorithm,  and  it  guar¬ 
antees  that  the  image  groups  correspond  to  spa¬ 
tially  coherent  landmarks. 

4.  Recognition 

Given  a  set  of  image  groups  Q,  characterized  by 
their  mean  vector  v,-,  their  eigenvalues  and 

their  eigenvectors  v/...  v/,  the  problem  is  to  com¬ 
pare  these  groups  to  a  new  image  whose  character¬ 
istic  vector  is  v.  The  eigenvalues  are  positive,  and 
the  eigenvectors  of  a  group  are  a  family  of 
orthonormal  vectors,  and  v  -  v,-  may  be  decomposed 
with  respect  to  this  family:  v  -  v,-  =  "Lj  uj  v/  +  r.  The 
distance  between  v  and  the  group  CJ  is  derived  as: 


Figure  3:  Comparison  of  two  new  images  (top) 
with  the  groups  shown  in  Figure  2.  The 
distance  graph  of  the  left  image  is  shown  as  a 
solid  line.  The  images  are  correctly  classified 
in  group  7. 


5.  Examples 

The  recognition  algorithm  was  tested  using  several 
sequences  taken  from  a  moving  vehicle.  Figure  4 
shows  sample  images  from  a  test  sequence  taken 
over  the  same  course  as  the  training  sequence  of 
Figure  1 .  A  total  of  68  images  were  digitized  from 
the  test  sequence  at  regular  intervals.  The  test  set  of 
images  was  segmented  manually  into  subsets  cor¬ 
responding  to  the  landmarks  identified  in  the  train¬ 
ing  sequence.  This  provided  the  “ground  truth”  for 
evaluating  the  performance  of  the  algorithm.  Vehi¬ 
cle  position  was  recorded  using  the  INS  system  on¬ 
board  a  HMMWV  used  in  a  separate  Unmanned 
Ground  Vehicle  (UGV)  project  [Hebert  et  al., 
1996]. 


d{C^,  V) 


(5) 


This  formula  allows  us  to  find  which  is  the  closest 
group  to  an  image.  The  problem  is  then  to  decide  if 
the  image  really  belongs  to  this  group.  We  again 
use  the  task  constraints.  Intuitively,  consecutive 
images  should  be  classified  in  a  consistent  manner. 
Since  we  know  the  spatial  extent  from  which  a  par¬ 
ticular  group  of  images  is  visible  in  the  training 
sequence,  we  can  eliminate  the  inconsistent  classi¬ 
fication  results  as  unreliable. 


Using  the  algorithm  outlined  above,  all  the  images 
in  the  test  sequence  were  compared  with  the  land¬ 
marks  found  in  the  training  sequence.  The  images 
corresponding  to  a  distinct  distance  minimum  are 
labeled  with  the  corresponding  landmark  number. 
Images  that  do  not  correspond  to  a  distance  mini¬ 
mum  for  any  landmark  are  left  unclassified.  As  we 
noted  earlier,  our  goal  is  not  to  label  every  image 
but  rather  to  correctly  recognize  the  ones  corre¬ 
sponding  to  the  most  salient  landmarks. 

Figure  5  illustrates  the  classification  algorithm  for 
three  different  landmarks.  The  graphs  show  that, 
for  those  three  landmarks,  the  distance  minimum  is 
attained  at  the  correct  images  in  the  test  sequence. 
Figure  7  shows  a  view  of  all  the  landmarks  recog¬ 
nized  in  the  path  travelled  in  the  test  sequence. 
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Figure  4:  Sample  images  from  a  test  sequence. 
The  complete  sequence  is  68  images  over  a 
course  similar  to  the  one  used  in  Figure  1  for 
the  training  sequence.  The  orientation  of  the 
camera,  and  the  illumination  characteristics  are 
substantially  different  from  those  of  the 
training  sequence. 

Four  landmarks  are  recognized  in  this  example. 
Figure  7  also  illustrates  the  potential  use  of  land¬ 
mark  recognition  for  position  estimation. 

Two  types  of  errors  may  occur  during  recognition. 
First,  images  that  should  match  a  landmark  are 
matched  to  a  different  landmark.  We  call  those 
images  misclassified  images.  Second,  images  that 
should  match  a  given  landmark  are  not  matched 
with  any  landmark.  We  call  this  second  class  of 
images  unclassified  images.  To  reduce  the  error 
rate,  we  use  the  sequential  constraint  described  ear¬ 
lier.  This  constraint  is  quite  effective  in  practice. 
Figure  6  shows  the  error  statistics  for  the  recogni¬ 
tion  algorithm  with  and  without  sequential  con¬ 
straint. 

All  the  examples  given  so  far  involve  images  taken 
in  urban  environments.  We  have  also  conducted 
experiments  in  natural  environments  by  collecting 
training  sequences,  extracting  groups  of  distinctive 
images,  and  recognizing  them  in  test  sequences. 
Figure  8  shows  two  example  groups  computed 
from  a  typical  training  sequence.  The  error  rate,  is 
comparable  to  the  one  obtained  in  urban  scenarios. 

6.  Conclusion 

Results  on  image  sequences  in  real  environment 
show  that  visual  learning  techniques  can  be  used 
for  building  image-based  models  suitable  for  rec¬ 
ognition  of  landmarks  in  complex  scenes.  The 


Figure  5:  The  classification  algorithm  illustrated 
on  two  landmarks.  Each  graph  shows  the 
distance  between  all  the  images  of  the  test 
sequence  of  Figure  4  the  groups  found  in  the 
training  sequence  (Figure  2.)  The  graphs  are 
shown  for  landmarks  2  and  7.  The  graphs  show 
that  the  distance  is  minimum  for  the  correct 
landmark. 


Urban  Environment: 

distance  only 

with  sequential 
constraint 

correct 

72% 

93% 

misclassified 

19% 

0% 

unclassified 

9% 

7% 

Natural  Environment: 

distance  only 

with  sequential 
constraint 

correct 

84% 

97% 

misclassified 

6% 

0% 

unclassified 

10% 

3% 

Figure  6:  Performance  of  the  recognition 
algorithm  on  the  two  example  sequences. 
Images  are  labeled  as  “misclassified”  if  they 
are  matched  to  the  wrong  group;  they  are 
labeled  as  unclassified  if  they  belong  to  a 
group  but  are  not  matched. 


1472 


Figure  7:  Overhead  view  of  the  path  followed 
while  collecting  the  images  of  the  test 
sequence  of  Figure  4  (distances  are  indicated 
in  meters.)  Four  landmarks  are  correctly 
identified.  Example  images  from  the  test 
sequence  are  shown  for  each  landmark. 


Figure  8:  Images  of  one  of  the  landmarks  found 
in  a  sequence  of  images  in  a  natural 
environment. 


approach  performs  well,  even  in  the  presence  of 
significant  photometric  and  geometric  variations, 
provided  that  the  appropriate  domain  constraints 
are  used.  In  the  case  of  mobile  robot  navigation, 
domain  constraints  include  the  sequential  nature  of 
the  images,  and  the  discriminability  of  landmarks. 

Our  goal  is  to  demonstrate  the  use  of  landmark  rec¬ 
ognition  for  navigation.  Specifically,  we  will  show 
that  rough  position  estimation  and  navigation 
based  on  the  relative  positions  of  landmarks  can  be 


achieved  using  image-based  landmark  recognition. 
Several  limitations  of  the  approach  need  to  be 
addressed.  First  of  all,  rejection  of  unreliable 
groups  need  to  be  improved.  In  particular,  the 
selection  of  the  parameters  controlling  the  group¬ 
ing  need  to  be  implemented  in  a  principled  manner. 
Second,  images  that  do  not  contribute  information 
should  be  filtered  out  of  the  training  sequences. 
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Abstract 

A  wide  variety  of  machine  learning  mechanisms 
create  multiple  models  that  must  be  reconciled, 
chosen  among,  or  in  some  cases,  orchestrated.  In 
its  most  general  form,  this  orchestration  prob¬ 
lem  can  be  seen  as  part  of  the  multi-agent  learn¬ 
ing  problem.  This  paper  introduces  particu¬ 
lar  complexities  of  automatically  sub-dividing  a 
problem,  developing  multiple  solutions  to  each 
sub-problem,  and  then  orchestrating  the  sub¬ 
solutions  into  a  complete  solution.  This  pa¬ 
per  establishes,  through  a  series  of  experiments, 
that  this  divide  and  conquer  strategy  can  be 
done  automatically,  the  effectiveness  of  four  in¬ 
troduced,  specific  techniques  for  learning  or¬ 
chestration,  and  that  the  orchestration  of  sub¬ 
solutions  is  a  rich,  intriguing  area  for  machine 
learning  study. 

1  Introduction 

There  are  many  cases  in  which  a  task  to  be  ap¬ 
proached  with  machine  learning  techniques  can 
be  or  must  be  solved  in  more  that  one  “piece.” 
Learning  a  team  of  robotic  soccer  players  is 
a  good  example  of  a  task  that  could  conceiv¬ 
ably  be  done  as  a  single  agent,  but  lends  it- 
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self  very  naturally  toward  learning  sub-solutions 
and  then  (or  in  addition)  learning  to  ensure  the 
mutual  suitability  of  these  sub-solutions.  This 
insurance  of  mutual  suitability  is  the  orchestra¬ 
tion  problem. 

This  paper  will  focus  on  PADO,  a  evolution¬ 
ary  computation  framework  designed  specifi¬ 
cally  for  signal  classification  (e.g.,  [Teller  and 
Veloso,  1997,  Teller  and  Veloso,  1995b,  Teller 
and  Veloso,  1995a]).  As  a  process  of  divide 
and  conquer,  PADO  evolves  multiple  pools  of 
sub-solutions  and  then  orchestrates  one  or  more 
learned  models  from  each  pool.  The  question 
we  investigate  in  this  work  is,  “What  opportu¬ 
nities  are  there  for  learning  in  the  orchestration 
process  and  how  much  improvement  can  this 
learning  provide?”  Our  experiments  demon¬ 
strate  that  orchestration  is  an  important  issue 
and  that  learned  orchestration  can  dramatically 
improve  generalization  performance. 

2  PADO 

PADO,  Parallel  Algorithm  Discovery  and 
Orchestration,  is  an  evolutionary  learning 
paradigm  specifically  designed  for  signal  clas¬ 
sification.  At  the  highest  level  of  description, 
PADO  has  two  main  learning  components:  al¬ 
gorithm  discovery,  and  orchestration.  PADO 
does  the  algorithm  discovery  through  a  process 
of  program  evolution  as  pictured  in  Figure  1. 

PADO  evolves  programs  in  a  PADO-specific 
graph  structured  language.  At  the  beginning 
of  a  learning  session,  the  main  population  is 
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filled  with  P  programs  that  have  been  randomly  signal  class  I  and  “low  confidence”  when  pre¬ 
generated  using  a  grammar  for  the  legal  syntax  sented  with  a  signal  from  any  other  class. 

of  the  language.  All  programs  in  this  language  ^  simple  orchestration  scheme  that  was  used  in 
are  constrained  by  the  syntax  to  return  a  num-  reported  PADO  work  (e.g.,  [Teller  and 

ber  that  is  interpreted  as  a  confidence  value.  (j^pi^ted  in  Figure  2.  System^ 

The  exact  structure  of  the  language,  Neural  ^  programs  that  best  (based 

Programming,  and  the  associated  recombina-  training  results)  learned  to  recognize  the 

tion  process  is  detailed  in  [Teller  and  Veloso,  ^ass  I.  The  S  responses  that  the 

1996].  For  the  purposes  of  this  paper  it  suf-  ^  programs  return  on  seeing  a  particular  signal 
fices  to  understand  that  the  PADO  architecture  combined  into  a  weighted  average.  This  av- 

has  the  ability  to  learn  a  number  of  programs  jg  interpreted  as  the  confidence  of  system^ 

that  do  well  at  solving  part  or  all  of  a  particular  signal  in  question  is  from  class  I.  The 

problem.  responses  of  the  C  pools  are  weighted  and  com¬ 

bined  in  a  similar  way. 


Figure  1:  The  evolution  of  programs 

The  goal  of  the  PADO  architecture  is  to  learn 
to  take  signals  as  input  and  output  correct  class 
labels  (i.e.  perform  the  signal-to-symbol  map¬ 
ping).  When  there  are  C  classes  to  choose  from, 

PADO  starts  by  learning  C  different  “pools”  of  Figure  2:  PADO’s  old  orchestration  strategy 
discrimination  programs.  System^  is  responsi¬ 
ble  for  taking  a  signal  as  input  and  returning  ^.jjjg  paper  we  will  concentrate  on  the  higher 
a  confidence  that  class  /  is  the  correct  label.  jg^g]  orchestration  in  PADO.  We  will  make  the 

System/  is  built  out  of  one  or  more  programs  simplifying  assumption  that  exactly  one  pro- 

for  pool  /,  learned  by  PADO.  Each  of  these  gram  will  be  chosen  from  each  discrimination 

programs  does  exactly  what  the  system  as  a  orchestrated  one  of  the  ways  out- 

whole  does;  it  takes  a  signal  as  input  and  re-  jjjjg^j  jjj  ^j^g  jjg^t  section.  This  simplification  of 

turns  a  confidence  that  label  /  is  the  correct  PADO’s  orchestration  for  explanation  purposes 

label  (rather  than  returning  a  value  between  1  jg  shown  in  Figure  3.  The  following  section  will 

and  C).  The  reason  for  this  seeming  redun-  address  issues  of  which  programs  to  use,  how  to 

dancy  can  be  found  in  [Teller  and  Veloso,  1997].  ygg  them,  and  how  to  bias  them  through  evolu- 

PADO  performs  signal  classification  by  orches-  tion  itself, 

trating  the  responses  of  the  C  systems. 

The  basic  justification  for  subdividing  work  as 
PADO  does  is:  it  is  usually  preferable  to  search 
C  spaces  of  size  2^  rather  than  one  space  of 
size  2*^^  (A  >  1).  In  classification  problems  it 
is  possible  to  automatically  divide  up  a  prob¬ 
lem:  one  classification  problem  of  C  classes  can 
always  be  decomposed  into  C  discrimination  Figure  3:  Part  of  PADO’s  new  orchestration 
problems.  A  program  in  System/  learns  to  say  strategy 

“high  confidence”  when  shown  an  instance  of 
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3  Orchestration  Options 

The  orchestration  options  that  we  investigate  in 
the  course  of  this  work  include:  1)  default  or¬ 
chestration,  2)  evolved  orchestration,  3)  learned 
weight  orchestration,  4)  learned  program  or¬ 
chestration,  and  5)  combination  strategies. 

3.1  Default  Orchestration 

In  the  simplest  option,  default  orchestration, 
PADO  picks  the  best  program  from  each  dis¬ 
crimination  pool  and  orchestrates  them  using 
a  fixed  function  with  fixed  parameters  (i.e.  no 
learning).  This  orchestration  will  be  the  base¬ 
line  against  which  we  will  measure  other  tech¬ 
niques.  The  first  step  in  default  orchestration  is 
to  pick  the  best  program  from  each  discrimina¬ 
tion  pool  to  be  in  the  orchestrated  set.  Specif¬ 
ically,  the  procedure  is  to  sort  the  programs  in 
each  discrimination  pool  according  to  training 
set  fitness  and  choose  the  top  ranked  program 
from  each  discrimination  pool.  This  program 
set  is  orchestrated  with  a  simple  function  using 
reasonable,  fixed  coefficients,  hereafter  referred 
to  as  weights. 

Each  program  X  has  a  fitness,  Fx,  (averaged 
over  all  training  examples)  that  ranges  between 
0.0  and  1.0.  Let  us  give  each  program  X  se¬ 
lected  as  part  of  the  final  PADO  classifier  an 
orchestration  weight  of  Wx  =  Fx-  For  a  par¬ 
ticular  test  signal,  each  program  is  shown  the 
signal  and  returns  a  response  Rx  between  Rmin 
and  Rmax-  Let  us  refine  Rx  according  to: 


Rx  :=  {Wx  *  Rx)  -  (1  -  Wx)  *  ^ 

(1) 

Now  if  the  program  from  class  i  has  the  high¬ 
est  Rx,  PADO  concludes  the  test  signal  is  from 
class  i.  However,  because  of  equation  1,  PADO 
“listens  more  attentively”  to  programs  that  are, 
on  average,  “more  reliable”  (i.e.  have  higher 
fitness).  Notice  that  this  default  orchestration 
makes  no  attempt  to  pick:  the  best  weights  for 
each  program,  the  best  programs  to  orchestrate, 
or  a  fitness  function  that  promotes  orchestra¬ 
tion. 


3.2  Evolved  Orchestration 

An  alternative  to  learning  how  best  to  orches¬ 
trate  a  number  of  programs,  or  which  programs 
to  orchestrate,  is  to  try  to  change  the  basic 
learning  of  the  programs  so  that  the  programs 
that  perform  best  on  the  training  set  will  also  be 
the  best  orchestrated  exactly  as  they  are.  Be¬ 
cause  PADO  learns  programs  in  an  evolutionary 
framework,  this  amounts  to  incorporating  the 
demands  of  a  particular  orchestration  strategy 
into  the  fitness  function.  At  each  generation,  af¬ 
ter  computing  the  fitness  of  each  evolving  pro¬ 
gram,  PADO  can  assign  each  program  a  new 
fitness,  based  on  its  ability  to  orchestrate  with 
random  individuals  from  the  other  discrimina¬ 
tion  groups. 

For  each  discrimination  pool  i,  PADO  creates 
K  biased-random^  groups  (using  one  program 
from  each  of  the  other  C  —  1  discrimination 
pools)  called  For  each  discrimination 

group  i,  for  each  program  X  in  the  group: 

Weights) 

Fx  -  ^  (2) 

Eva^AUFj,  Weights)  is  the  percentage  of  train¬ 
ing  examples  the  orchestrated  PADO  program 
set  of  {XuPj}  correctly  classifies,  orchestrated 
with  Weights  (just  the  Fjy’s  in  this  case).  This 
value  is  an  approximation  to  how  well  X  “or¬ 
chestrates  in  general”,  relative  to  other  pro¬ 
grams  in  the  same  discrimination  pool. 

From  this  point  on,  PADO  follows  the  default 
orchestration  strategy  except  that  the  program 
from  each  discrimination  pool  chosen  for  orches¬ 
tration  will  be  the  best  based  on  rather 

than  old  fitness.  In  evolved  orchestration, 
replaces  Fx  in  the  entire  process. 

3.3  Learned  Weight  Orchestration 

In  learned  weight  orchestration  PADO  tries  to 
find  the  best  set  of  weights  for  the  chosen  set  of 
programs  to  be  orchestrated.  We  get  one  degree 
of  freedom  for  each  program  to  be  orchestrated 

*  Picked  randomly,  bieised  for  higher  fitness  individu¬ 
als. 


1477 


by  allowing  each  Wx  to  vary  between  0  and  1 
(instead  simply  setting  it  to  Fx)- 

To  begin,  this  strategy  will  set  the  Frogs  to  or¬ 
chestrate  to  be  the  best  program  from  each  dis¬ 
crimination  pool  based  on  training  set  fitness. 
Next,  the  strategy  will  initialize  the  Weights 
and  BestWeights  for  orchestration  to  Wx  =  Fx 
for  Wi..Wc  just  as  in  the  default  orchestration 
case.  Now  we  can  search  for  better  values  of 
Wi-.Wc-  Now  repeat  S  times: 

1.  Pick  i  between  1  and  C. 

2.  Change  Weights[i] 

3.  If  Eval(Progs,Weights)  > 

Eval(Progs, Best  Weights) , 
then  update  BestWeights[i] 

Do  step  1  based  on  previous  successes  changing 
that  element  of  the  vector.  Change  step  2  in 
the  direction  that  most  recently  helped  when 
changing  this  vector  element.  Anneal  step  3  so 
that  local  minima  are  partly  avoided. 

3.4  Learned  Program  Orchestration 

In  learned  program  orchestration  weights  are 
fixed  at  their  default  values  {Wx  =  Fx)  and 
PADO  tries  to  find  the  set  of  programs  that 
best  fit  those  weights.  We  get  one  degree  of 
freedom  per  class  (per  orchestration  weight)  by 
allowing  the  program  to  orchestrate  for  discrim¬ 
ination  class  i  to  vary  over  all  of  the  programs 
in  discrimination  pool  i. 

To  begin,  this  strategy  will  initialize  the  Progs 
and  BestProgs  to  be  the  best  program  from  each 
discrimination  pool  based  on  standard  Fx-  The 
orchestration  weights  {Weights)  will  be  set  for 
each  program  X  to  be  orchestrated  to  its  Fx- 
Now  we  can  search  for  better  programs  X\  - 
Xc-  Now  repeat  S  times: 

1.  Pick  i  between  1  and  C. 

2.  Exchange  Progs[f]^  for  another 

program  from  the  discrimination 
class  i  pool. 

3.  If  Eval(Progs,Weights)  > 

Eval(BestProgs,  Weights) , 
then  update  BestProgs[i] 


Do  step  1  based  on  previous  successes  changing 
that  element  of  the  vector.  Pick  step  2  with  a 
bias  toward  more  highly  fit  programs.  Anneal 
step  3  so  that  local  minima  are  partly  avoided 

3.5  Example  Combination  Strategies 

It  is  possible  to  combine  variations  like  the  ones 
described  above  in  a  variety  of  ways.  For  exam¬ 
ple,  we  can  first  search  through  program  space 
with  a  set  of  fixed  weights  (section  3.4),  and 
then  once  having  found  this  cooperative  set  of 
programs  we  can  then  search  through  weight 
space  (section  3.3)  for  this  discovered  set  of 
programs  for  the  optimal  weight  set  for  those 
programs.  This  combination  strategy  will  be 
referred  to  in  the  following  experiments  sec¬ 
tion  as  simply  “Learned  Prog..Weight  Orches¬ 
tration.”  Another  combination  possibility  is 
to  evolve  better  orchestration  (section  3.2)  and 
then,  at  orchestration  time,  to  do  a  search  to 
find  the  best  program  set  (section  3.4). 

4  Experiments 

All  of  the  following  experiments  were  performed 
with  the  following  fixed  parameters:  popula¬ 
tion  size  600,  crossover  48%,  mutation  48%, 
12000  program  combinations  considered  when 
learning  program  orchestration,  12000  weight 
combinations  considered  when  learning  weight 
orchestration,  and  no  drift  between  the  sub¬ 
populations.  Every  curve  in  every  figure  in  this 
section  is  an  average  over  at  least  50  separate, 
independent  runs.  Programs  were  exposed  to 
up  to  200  randomly  generated  training  signals^. 
The  test  set  consisted  of  a  different  set  of  500 
randomly  generated  signals. 

4.1  Four  Classes 

In  this  problem,  a  simple  wave  form  type  was 
used  as  the  input.  The  location  of  the  two 
spikes  in  the  otherwise  highly  damped  wave 
move  about  evenly  over  the  signal.  The  signal 
has  four  possible  classes  that  are  encoded  as  a 
two  bit  binary  number  in  the  two  wave  spikes, 
low  amplitude  means  “0”,  high  variance  means 

®See  [Teller  and  Andre,  1997]  for  details. 
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“1.”  Figure  4  shows  one  randomly  selected  ex¬ 
ample  from  each  of  these  four  classes. 


Figure  4:  One  example  from  each  of  the  four 
classes. 

Figure  5  has  a  number  of  different  curves.  This 
is  a  graph  measuring  the  computational  effort 
required  to  achieve  a  given  level  of  task  per¬ 
formance.  For  example,  the  “default  orchestra¬ 
tion”  curve  has  a  point  at  (0.7,15).  This  is  to 
be  read  as  “In  order  to  achieve  a  70%  gener¬ 
alization  test  set  performance  using  the  default 
orchestration  strategy,  PADO  must  run  for,  on 
average,  15  generations.” 

The  “default  orchestration”  curve  is  the  one 
against  which  the  other  curves  should  be  mea¬ 
sured  and  shows  PADO’s  performance  when 
no  orchestration  search  is  done.  Figure  5  also 
shows  four  learning  strategies:  learned  program 
orchestration,  learned  weight  orchestration, 
evolved  orchestration,  and  evolved. .learned  pro¬ 
gram  orchestration  (do  evolved  orchestration 
and  then  do  learned  program  orchestration). 


Computational  Effort  In  Generations 


Figure  5:  Four  Class  Results. 

The  most  important  feature  of  Figure  5  is 
that  the  evolved  orchestration  actually  does 
worse  than  the  default  orchestration  strategy 
(i.e.  no  learning).  While  PADO  does  bet¬ 
ter  in  the  “evolved.. learned  program  orchestra¬ 
tion”  paradigm  than  the  “evolved  orchestra¬ 


tion”  paradigm,  this  is  still  worse  than  simple 
default  orchestration. 

Though  this  paper  will  not  try  to  substantiate 
this  claim,  here  is  an  intuition  for  why  putting 
the  orchestration  directly  into  the  fitness  func¬ 
tion  actually  hurt  PADO’s  performance.  In 
the  default  orchestration  case  there  was  some 
orchestration  going  on,  just  no  orchestration 
learning.  In  addition,  PADO  begins  by  sub¬ 
dividing  its  population,  and  thereby  the  prob¬ 
lem,  into  several  easier  sub-problems.  This  sub¬ 
problem  division  is  reasonable  for  classification 
and  PADO  enforces  it  throughout  the  entire 
run.  When  fitness  gets  tied  to  orchestration 
instead  of  discrimination,  PADO  loses  exactly 
those  distinctions.  So  we  argue  that  PADO  may 
have  gained  something  through  evolved  orches¬ 
tration,  but  at  the  cost  of  losing  the  whole  mech¬ 
anism  of  divide  and  conquer  that  made  orches¬ 
tration  important  in  the  first  place. 

4.2  Eight  Classes 

Our  next  experiment  will  take  a  second  look 
at  the  “Learned  Program  Orchestration”  and 
“Learned  Weight  Orchestration”  strategies.  In 
this  second  experiment  we  preserve  all  the  de¬ 
tails  of  the  previous  experiment  except  that 
the  problem  has  more  classes  and  is  a  harder 
problem  of  the  same  type.  The  natural  exten¬ 
sion  of  the  previous  experiment  along  these  con¬ 
straints  is  to  use  an  eight  class  signal  classifica¬ 
tion  problem  with  similar  damped  waves  and 
spikes.  There  are  now  three  spikes  in  each  wave 
form,  still  in  variable  locations,  and  the  three 
spike  amplitudes  encode  the  class  (low  ampli¬ 
tude  is  “0”  and  high  amplitude  is  “1”).  Fig¬ 
ure  6  shows  one  randomly  selected  signal  from 
each  class. 

Figure  7  shows  the  average  performance  re¬ 
sults  of  four  different  strategies.  Again,  the 
benchmark  performance  is  the  curve  labeled 
“Default  Orchestration”  The  new  strategy  that 
this  section  introduces  is  labeled  “Learned 
Prog. .Weight  Orchestration”  and  is  simply  a 
search  through  program  space  followed  by  a 
search  through  weight  space. 

Figure  7  has  two  main  points  of  interest.  The 
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Figure  6:  One  example  from  each  of  the  eight 
classes. 


Figure  7;  Eight  Class  Results. 


first  to  notice  is  that  the  problem  is  indeed 
harder;  generation  30  produces  a  lower  gener¬ 
alization  PADO  system.  The  second  point  of 
interest  is  that,  again,  all  three  of  the  orchestra¬ 
tion  learning  procedures  improve  considerably 
over  default  orchestration. 

4.3  Ten  Classes 

As  a  further  experiment  let  us  now  try  a  dif¬ 
ferent  kind  of  signal  and  increase  the  number 
of  classes  in  the  domain  again.  For  this  exper¬ 
iment  we  will  test  the  same  four  orchestration 
strategies  on  a  signal  that  expresses  its  class  by 
the  slope  of  its  wave  form.  Since  this  is  not 
directly  observable  by  the  PADO  constituent 
programs  and  so  must  be  inferred  from  multi¬ 
ple  local  observations,  this  is  still  not  a  trivial 
problem.  Figure  8  shows  one  randomly  selected 
example  from  each  class. 

Once  again,  we  can  see  from  the  curves  in  Fig¬ 
ure  9  that  all  three  orchestration  learning  strate¬ 
gies  do  better  than  orchestration  without  learn¬ 


ing.  In  this  experiment  the  search  in  weight 
space  was  less  effective  than  the  search  in  pro¬ 
gram  space,  though  still  helpful.  By  the  time 
we  reach  a  PADO  system  generalization  level 
of  60%,  the  standard  orchestration  method  is  a 
much  steeper  slope  than  the  “Learned  Program 
Orchestration”  and  “Learned  Prog..Weight  Or¬ 
chestration”  strategies. 
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Figure  8:  One  example  from  each  of  the  ten 
classes. 


Figure  9:  Ten  Class  Results. 

In  the  last  two  experiments  discussed  in  this  pa¬ 
per,  the  “Learned  Prog..Weight  Orchestration” 
curve  is  within  statistical  noise  of  the  “Learned 
Programs  Orchestration”  curve  for  every  level  of 
generalization.  The  explanation  of  this  is  that 
because  in  the  “Learned  Prog.. Weight  Orches¬ 
tration”  strategy  the  best  program  set  is  found 
for  a  particular  fixed  weight  vector  and  then  the 
weights  tuned  for  that  program,  it  is  not  sur¬ 
prising  that  for  the  program  set  selected,  the 
best  weight  vector  is  almost  always  the  one  un¬ 
der  which  it  was  optimized. 

5  Related  Work 

Examples  of  the  evolution  of  behavior  coordi¬ 
nation  can  be  found  in  [Haynes  et  ai,  1995, 
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Haynes  and  Sen,  1996].  In  these  examples,  the 
“teams”  are  explicitly  grouped,  leading  to  a  nat¬ 
ural  incorporation  of  cooperation  into  the  fit¬ 
ness  function.  [Wolpert,  1992]  gives  a  very 
thorough  theoretical  account  of  “stacked  gener¬ 
alization.”  The  general  conceit  of  stacked  gen¬ 
eralization  is  that  instead  of  having  a  learning 
algorithm  entirely  solve  a  problem,  one  or  more 
models  can  be  used  to  partially  solve  the  prob¬ 
lem.  Then,  the  output  of  that  model (s)  can  be 
“stacked”  as  inputs  to  a  new  learner.  Though 
the  description  is  very  different,  the  orchestra¬ 
tion  problem  can  be  seen  as  a  specific  difficulty 
in  stacked  generalization.  This  work  has  at¬ 
tempted  to  address  some  of  these  specific  dif¬ 
ficulties. 

6  Conclusions 

We  addressed  the  orchestration  problem  in  the 
context  of  PADO,  a  divide  and  conquer  evolu¬ 
tionary  technique  for  signal  classification.  In 
general,  the  issues  studied  apply  to  devide-and- 
conquer  learning  problems  in  which  putting  the 
sub-solutions  together  again  (i.e.,  orchestrating 
them)  is  non-trivial. 

Three  experiments  on  distinct  signals  demon¬ 
strated  the  feasibility  of  PADO’s  divide  and 
conquer  strategy.  The  failure  of  the  evolved  or¬ 
chestration  procedure  suggested  PADO’s  prefer¬ 
ability  to  unconstrained  learning.  These  experi¬ 
ments  provided  a  specific  justification  for  main¬ 
taining  a  population:  orchestration  puts  the  op¬ 
tions  a  population  provides  to  good  use.  This 
work  introduced  four  specific  techniques  for  or¬ 
chestration  learning  which  demonstrated  that 
orchestration  is  an  important  issue  and  that 
learned  orchestration  can  provide  dramatic  gen¬ 
eralization  improvements. 
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Abstract^ 

We  have  assembled  a  multi-spectral  imaging  sys¬ 
tem  operating  in  the  visible  to  near  IR  range  utiliz¬ 
ing  an  existing  acousto-optic  tunable  filter  (AOTF). 
This  configuration  has  been  characterized,  yielding 
design  optimization  information.  Critical  data 
include  spatial  and  spectral  resolution,  out-of-band 
rejection,  efficiency,  field  of  view,  and  bandwidth. 
The  design  goal  is  efficient  operation  over  nearly 
two  octaves  of  wavelength,  and  superior  image 
quality.  Two  major  issues  were  successfully 
addressed:  1)  the  method  of  applying  the  multiple 
electrical  RF  control  signals  to  the  AOTF  trans¬ 
ducer  to  fully  exploit  the  multispectral  capability, 
and  2)  object  identification  using  color  signature 
signal  deliverted  by  the  AOTF. 

1.  Objective 

This  program  task  incorporates  the  spectral  (color) 
dimension  into  the  object  recognition  process.  A 
programmable  optical  filter  is  utilized  at  the  sys¬ 
tem’s  front  end  to  reduce  the  computational  load 
and  its  resulting  bottlenecks  in  future  automated 
vision  systems.  By  filtering  the  incoming  scene 
according  to  its  spectral  composition,  a  large 
amount  of  undesirable  background  clutter  can  be 
removed  prior  to  digital  processing.  Figure  1  is  a 
schematic  representation  of  the  process.  Enhanced 
performance  is  anticipated  in  a  variety  of  applica¬ 
tions,  including  human  sensory  augmentation  sys- 


1.  This  research  has  been  sponsored  in  part  by  Office  of  Naval 
research  (ONR)  under  Contract  N00014-95-1-0591.  The 
views  and  conclusions  contained  in  this  document  are  those  of 
the  authors  and  should  not  be  interpreted  as  representing  the 
official  policies,  either  expressed  or  implied,  of  ONR  or  the 
U.S.  Government. 


terns  for  driver  assistance.  This  recognition  process 
will  more  closely  copy  the  human  observer  in  its 
ability  to  extract  and  track  objects. 


very  low  latency. 


Figure  1:  Object  recognition  using  color 
discrimination. 


2.  Approach 

Multi-spectral  imaging  is  based  on  a  “smart  filter” 
concept  that  utilizes  features  available  with  the 
acousto-optic  tunable  filter  (AOTF).  Such  smart  fil¬ 
ters  provide  dynamically  programmable  spectral 
image  selection,  which  in  combination  with  new 
computational  sensors,  can  result  in  new  strategies 
for  object  recognition  and  tracking.  The  key  fea¬ 
tures  of  the  AOTF  which  make  them  so  effective 
for  this  purpose,  include  all-solid  state  construction 
and  electronic  operation,  rapid  random  access  of 
wavelength,  simultaneous  multiple-wavelength 
operation,  and  electronically  adjustable  pass  band¬ 
width;  these  features  are  readily  adaptable  to  com¬ 
puter  control.  In  addition,  the  devices  are  compact 
and  robust,  and  easily  integrated  into  the  optical 
package.  While  combinations  of  many  these  fea¬ 
tures  can  be  found  in  other  types  of  devices,  the 
AOTF  is  unique  in  its  simultaneous  multiple  wave¬ 
length  capability.  As  such  it  can  remove  latency 
and  simplify  the  development  of  recognition  algo- 
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rithms.  The  key  components  needed  to  incorporate 
optical  preprocessing  with  a  single  IR  camera  are 
shown  in  Figure  2.  Figure  3  shows  a  photograph  of 


object 

polarized 

radiation 


defining 

optics 


AOTF 


Stress  optic 
retarder 


■  «  ■ 
software 
programmable 
controller 


Figure  2;  Key  components  of  the  multispectral 
imager. 


AOTF 


Figure  3:  Smart  filter  camera  system  that  will  be 
used  for  data  collection. 
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Figure  4:  Multispectral  filtering  can  simplify  the 
processing  in  object  recognition  scenarios  and 
enhances  target  identification. 

images  were  taken  through  AOTF  tuned  to  red 
(600-630  nm)  and  green  (500-530  nm)  regions  of 
visible  spectrum,  respectively.  We  used  a  Visual 
Basic®  computer  code  to  implement  pixel-by¬ 
pixel  processing  of  the  raw  images.  As  a  result  of 
the  processing,  considerable  reduction  of  the  white 
background  was  achieved  with  simultaneous 
strong  increase  in  contrast  of  “friend”  vehicle  icon 
image  with  respect  to  a  “foe”  vehicle  icon  and  a 
background  (multiplication).  Finally,  intensity 
thresholding  of  the  scene  yields  an  image  of  the 
vehicle  icon  of  interest  only  and  suppresses  all  the 
other  features  (threshold  subtraction). 


3.  Factors  Affecting  AOTF  Image  Quality 


the  hardware  under  assembly  for  use  on  this  MURI 
project.  The  input  or  first  camera  lens  images  the 
scene  onto  the  first  imaging  plane  which  contains  a 
field-of-view  defining  iris.  This  iris  is  adjusted  to 
provide  complete  spatial  separation  of  the  filtered 
and  the  spectrally  notched  images  and,  if  neces¬ 
sary,  can  be  adjusted  to  suppress  unwanted 
“flare.”  The  middle  lens  (also  a  standard  camera- 
type  lens)  “collimates”  the  light  that  passes 
through  the  AOTF.  A  phase  retarder  may  be  added 
if  polarimetric  analysis  is  warranted.  The  third 
camera  lens  brings  the  filtered  light  from  the  first 
imaging  plane  onto  the  imaging  plane  of  the  cam¬ 
era.  Our  experience  confirms  that  this  optical 
arrangement  is  superior  to  alternative  geometries 
that  have  been  reported. 

Initial  experiments  on  multispectral  imaging  of  a 
color  scene  containing  “friend”  and  “foe”  vehicles 
have  provided  very  encouraging  results  as  shown 
in  Figure  4.  The  “signature”  and  “non-signature” 


We  have  addressed  one  of  the  critical  issues  that 
has  been  the  characterization  and  quantification  of 
the  various  factors  that  impact  the  overall  image 
quality.  The  AOTF,  because  of  its  diffractive 
nature,  degrades  image  quality  unless  adequate 
measures  and  compensation  is  made.  The  advan¬ 
tages  of  the  AOTF  in  spectral  imaging  systems  are 
largely  the  same  as  for  non-imaging  systems,  and 
have  been  well  established  by  recent  work.  The 
published  literature  contains  many  examples  of 
AOTF’s  being  successfully  incorporated  in  imag¬ 
ing  systems  to  be  used  for  a  variety  of  applications, 
but  also  point  to  several  critical  issues  that  must  be 
addressed.  These  issues  are  largely  driven  by  the 
particular  requirements  for  high  quality  imaging, 
and  often  are  not  relevant  for  non-imaging  applica¬ 
tions.  It  is  generally  necessary  to  pay  close  atten¬ 
tion  to  optimizing  the  design  of  the  AOTF  to  assure 
good  imaging  quality,  and  to  recognize  the  limita¬ 
tions  that  may  be  imposed  by  the  fundamental 
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physical  effects. 

A  variety  of  physical  factors  must  be  considered  in 
high  image  quality  AOTF  spectral  imaging  sys¬ 
tems.  The  factors  are  manifested  both  in  the  basic 
design  and  in  the  implementation  of  AOTF-based 
imaging  systems.  The  spatial  resolution,  back¬ 
ground  level  and  other  quality  criteria  of  the  AOTF 
spectral  image  depend  critically  on  the  AOTF 
design  details.  We  have  analyzed  the  factors  as 
they  relate  to  spectral  resolution,  spatial  resolution 
and  AO  interaction  dispersion.  Fabrication  issues 
affect  suppression  of  background  through  acoustic 
scattering,  acoustic  field  homogeneity  and  optical 
scattering.  In  this  paper  we  will  review  these 
effects  and  describe  some  measurements  and  anal¬ 
ysis  that  we  carried  out. 

3.1.  Image  Blur 

One  of  the  major  advantages  of  the  AOTF  is  the 
relatively  large  angular  aperture  which  enhances 
the  throughput  of  the  device.  The  large  acceptance 
aperture  results  from  the  birefringent  character  of 
the  diffraction  process,  so  that  there  will  be  a 
dependence  of  diffracted  image  angle  on  wave¬ 
length;  there  will  therefore  be  a  blur,  or  angular 
spread,  in  the  image  due  to  the  finite  spectral  band¬ 
pass  of  the  AOTF.  This  angular  spread  can  seri¬ 
ously  degrade  the  image  quality  unless  steps  are 
taken  in  the  design  to  minimize  it.  It  may  even  be 
possible  to  produce  practical  systems  in  which  the 
image  resolution  is  no  worse  than  imposed  by  aper¬ 
ture  diffraction. 

Closely  related  prior  work  on  AOTF  imaging  blur 
in  which  one  of  our  group  had  been  involved  was 
recently  reported  [Suhre  et  al.,  1992].  These  inves¬ 
tigations  were  done  in  the  8  to  12  micrometer 
infrared  range  utilizing  a  TAS  device.  Those  results 
were  greatly  extended  under  the  present  MURI 
program.  Calculations  were  made  for  A0^  ,  the 
internal  angular  spread  of  the  diffracted  light  as  a 
function  of  AA,,  the  bandpass  of  the  AOTF.  The 
principal  design  parameters  for  a  noncollinear 
AOTF  are  shown  in  Figure  5.  The  increase  in  A0^ 
with  bandpass  is  expected,  since  for  a  fixed  value 
of  the  angle  of  incidence,  0,-,  the  acoustic  beam 
spread  must  increase  to  accommodate  increasing 
values  of  bandpass.  This  must  be  done  by  decreas¬ 
ing  the  acoustic  interaction  length,  the  only 


remaining  design  parameter.  The  increasing  values 
of  AQj  with  increasing  0,-  can  be  understood 
through  some  simplifying  approximations  in  the 
analysis.  For  small  values  of  (n,-  -  nj)  and  (0,-  - 
QJ),  the  non-critical  phase  matching  (NPM)  condi¬ 
tion  can  be  approximated  as 

VA  =  no(®,-®d)  (1) 

where  ng  is  the  ordinary  index  of  refraction. 


Figure  5:  Principal  design  parameters  of  an  AOTF. 


The  usual  NPM  approximation  for  the  acoustic 
tuning  wavelength.  A,  is 

^n  4  2  1/2 

—  =  An(sin  0;  +  sin  20,)  (2) 

A  * 

so  that  an  approximation  to  the  beamspread  is 


An  ,  .  4^  .  1/2 

-pr-  =  — +  Sin  20;) 


(3) 


This  approximation  agrees  well  with  the  exact 
calculations  [Suhre  et  al.,  1992].  While  this  simpli¬ 
fied  formulation  shows  the  explicit  dependence  of 
the  blur  on  the  AOTF  bandpass,  it  is  straightfor¬ 
ward  to  recast  the  dependence  on  the  transducer 
length  by  substituting  for  AA,  its  dependence  on 
transducer  length  A, 


AA,  = 


1.87tA, 


(AnLsin^O;) 


(4) 


to  obtain 


1.87CA, 


f  .  4. 


.  2, 


\/2\ 


sin  0j  +  sin  20 

.  2_ 

Sm  0;  ^ 


(5) 


It  is  clear  that  near  diffraction  limited  performance 
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for  practical  designs  will  only  be  achieved  for 
fairly  long  interaction  lengths  L  yielding  high 
spectral  resolution,  since  we  must  satisfy  the  con¬ 
dition  that  A0^<  X/D.  For  a  one  centimeter  aper¬ 
ture  and  a  wavelength  of  10  micrometers,  this  will 
be  a  beamspread  of  about  1  milliradian;  (5)  sug¬ 
gests  that  values  of  0i  must,  in  general,  be  kept 
small  for  good  spatial  resolution.  The  experimental 
measurements  reported  in  reference  1  carried  out 
in  the  wavelength  range  from  7  to  1 1  micrometers 
were  in  good  agreement  with  this  analysis. 

Effects  of  image  blur  are  more  prominent  in  the 
visible  and  near  IR  using  Te02  crystals,  and  great 
care  is  needed  in  designing  imaging  AOTF’s  to 
minimize  blur  effects.  The  two  critical  effects  are 
due  to  bandpass,  as  described  above,  and  image 
shift  due  to  dispersion.  Analysis  and  measurements 
were  carried  out  at  CMU  [Wachman,  1996]  on  a 
Te02  AOTF  custom  built  for  this  purpose  by 
NEOS.  Transducer  length  could  be  varied  by  exter¬ 
nal  interconnections.  The  basic  AOTF  design 
parameters  for  this  AOTF  are  listed  in  Table  1 . 

TABLE  1.  AOTF-1  Parameter 

0,.  =  12° 

A0J-  =  6.5°(exr) 

0«  =  5.9° 

Lj  =  0.33cm 
L2  =  0.66cm 
L3  =  1.32cm 
L4  =  2.32cm 
AX/X  =  0.1  for  L  =2.32cm 

The  transducer  stmcture  consisted  of  four  ele¬ 
ments,  each  wired  to  a  connector  on  the  mount  so 
that  it  could  be  driven  independently.  Analysis  and 
measurements  on  image  blur  were  earned  out,  and 
the  results  are  summarized  in  Figure  6  for  the  four 
values  of  transducer  length.  The  analysis  indicates 
that  a  transducer  length  of  about  3.5cm  is  required 
to  reduce  the  image  blur  to  no  greater  than  that  due 
to  aperture  diffraction  from  a  1  cm  aperture,  at  0.7 
micrometers  wavelength.  This  calculation  is  in 
good  agreement  with  the  experimental  results  in 


Figure  6,  which  shows  spectrally  filtered  photo¬ 
graphs  of  a  resolution  chart  taken  with  the  four 
transducer  lengths.  The  horizontal  bars,  unaffected 
by  AO  diffraction,  are  near  diffraction  limited  reso¬ 
lution.  For  L  =  2.32  cm  the  analysis  predicts  the 
AO  blur  to  be  about  1.5x  the  diffraction  limit, 
while  it  is  about  6x  for  L  =  0.33  cm.  This  AOTF 
design  is  a  reasonably  good  match  between  the 
spectral  and  spatial  resolution  characteristics. 


Figure  6:  Target  images  taken  with  various 


3.2.  Background  Illumination 

Another  major  source  of  AOTF  image  degradation 
is  the  loss  of  contrast  due  to  high  background  lev¬ 
els.  This  background  is  largely  out-of-band  wave¬ 
lengths,  and  is  principally  due  to  three  causes:  high 
sideband  levels,  light  diffracted  by  acoustic  energy 
reflected  at  various  crystal  surfaces,  and  light  scat¬ 
tered  by  the  crystal  from  its  bulk,  surfaces,  and 
coatings.  Measurements  of  this  spectrally  broad¬ 
band  background  were  made  on  the  AOTF 
described  above,  and  the  results  are  summarized  in 
Figure  7. 

The  AOTF  was  tuned  for  peak  transmission  at  5 
wavelengths  across  its  range:  470  nm,  486  nm,  535 
nm,  650  nm  and  710  nm.  A  spectrometer  was  used 
to  measure  the  intensity  from  400  to  770  nm  for 
each  of  these  tuning  conditions.  Close  to  the  main 
peak,  high  sidelobes  are  the  principal  cause  of  the 
background,  while  further  from  the  main  peak 
phase  matching  to  reflected  acoustic  energy  may  be 
dominant.  The  latter  contribution  was  measured  by 
using  pulsed  RF  power  and  gating  the  detector  so 
as  to  discriminate  against  the  diffracted  light  due  to 
the  first  transit  acoustic  wave.  The  background  due 
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Figure  7:  Spectral  intensity  distribution 
for  several  AOTF  tuned  wavelengths. 

to  this  cause  is  greater  than  -20  dB  from  about  450 
to  650  nm.  For  imaging  application  for  which  a 
dynamic  range  of  more  than  30  dB  is  needed,  it  is 
clear  that  steps  must  be  taken  to  greatly  reduce  the 
background  from  these  effects  through  proper 
AOTF  design.  We  addressed  the  background  issues 
with  a  Te02  AOTF  imaging  system  based  on  a 
design  described  in  Table  2. 

TABLE  2.  AOTF-2  Parameters 
0.  =  10.7° 

AO.  =  6.5°  {ext) 

0^  =  4.0° 

0  =3° 

”face  wedge 

®blur  “  0.5  rad  (ext),  3x  diffraction  limit 
L  =  1.43  cm  (3  elements) 

^'k/'k  =  0.1 

We  have  found  that  a  major  cause  of  image  degra¬ 
dation  relates  to  the  transducer  structure  and 
method  of  interconnecting  elements.  In  order  to 
achieve  good  impedance  matching  for  a  large  area 
transducer,  it  is  necessary  to  either  interconnect 
several  small  elements  in  series,  or  drive  each  inde¬ 
pendently  from  a  power  splitter.  In  the  latter  case, 
the  elements  may  have  either  a  common  ground 
plane  or  isolated  grounds.  For  series  connection  a 
single  matching  circuit  is  used,  and  for  parallel 
connection  each  element  has  its  own  matching  cir¬ 


cuit.  The  imaging  results  we  obtained  using  multi¬ 
ple  elements  generally  shows  that  there  are 
distinct,  multiple  images,  and  a  blur  for  each  image 
correspond  not  to  the  entire  transducer  length,  but 
only  to  the  element  length.  This  behavior  suggests 
that  the  acoustic  wave  components  generated  at 
each  element  are  not  coherent,  possibly  not  co- 
directional.  We  have  found  with  one  parallel  con¬ 
nected  AOTF,  for  which  each  element  has  its  own 
matching  circuit,  multiple  images  resulted  from 
cross-talk  between  the  matching  circuit  due  to 
inadequate  isolation.  In  each  case,  we  found,  as 
expected,  a  strong  correlation  between  image  qual¬ 
ity  and  how  closely  the  passband  of  the  AOTF 
adhere  to  the  sinc^  function.  One  such  passband 
characteristic  is  illustrated  in  Figure  8;  note  the  sig¬ 
nificant  departure  from  a  sinc^  function. 


Frequency,  MHz 

Figure  8:  Spectral  resolution  of  NEOS  4-3-P-2 
AOTF. 


Frequency,  MHz 

Figure  9:  Spectral  resolution  of  NEOS  4-3-P-l  and 
4-3-S-l  in  series 

It  is  important  to  realize  that  even  in  the  case  of 
perfect  passband  behavior,  the  image  will  be  cor¬ 
rupted  by  the  sidelobes,  the  first  of  which  is  only 
13  dB  below  the  main  signal.  This  level  is  inade¬ 
quate  for  many  imaging  applications.  Attempts  to 
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address  this  problem  by  apodization  of  the  acoustic 
field  have  not  been  particularly  successful  because 
of  the  transducer  fabrication  difficulties  in  achiev¬ 
ing  a  good  enough  approximation  to  the  required 
field  profile.  We  have  demonstrated  an  alternative 
approach  to  sidelobe  reduction  which  is  based  on 
the  use  of  two  AOTF’s  in  series,  for  which  the 
passband  characteristic  is  sinc^.  The  expected 
transmission  for  two  identical  filters  in  series  is  a 
reduction  of  the  first  sidelobes  to  -26  dB,  and  a 
reduction  in  the  FWHM  of  the  main  lobe  by  to 
about  75%  of  the  single  filter.  A  measurement  of 
the  passband  was  made  for  two  AOTF’s  in  series, 
and  the  results  are  shown  in  Figure  9.  The  first 
sidelobe  is  about  25  dB  below  the  main  peak,  in 
good  agreement  with  theory. 


Measurements  were  made  of  the  white  light  spec¬ 
tral  imaging  characteristics  of  the  same  two 
AOTF’s,  individually  and  with  both  in  series  in  the 
optical  train.  The  target  consisted  of  a  cross  cut 
into  sheet  metal,  and  illuminated  from  behind.  The 
width  of  the  cutout  slit  was  3  mm,  and  the  target 
was  placed  a  distance  of  2.5  meters  from  the  cam¬ 
era;  therefore  the  slit  subtended  an  angle  of  1.2 
milliradians  at  the  input  optics.  Figure  10  shows 
the  image  of  the  target,  and  the  intensity  traces  of 
the  slit  perpendicular  to  the  AO  diffraction  plane, 
and  parallel  to  the  diffraction  plane.  The  image  of 
the  vertical  slit  is  corrupted  by  the  blur  and  side¬ 
lobe  effects,  and  this  is  reflected  in  the  trace.  In 
Figure  1 1  we  show  the  image  taken  with  the  two 
AOTF’s  in  series,  in  which  the  sidelobe  images  are 
no  longer  visible.  We  also  show  the  intensity  trace 
for  the  vertical  slit,  superposed  on  the  trace  for  the 
single  AOTF  to  show  the  same  effect. 


Figure  10:  Image  recorded  with  NEOS  4-3-S-l 
AOTF,  and  intensity  scans  parallel 
and  perpendicular  to  diffraction 


3.3.  Scattering 

A  contribution  to  broad  spectral  background  illu¬ 
mination  limiting  dynamic  range  will  be  caused  by 
scattering  of  the  incident  white  light  image  from 
the  AOTF  crystal  volume,  and  its  surfaces.  This  is 
not  normally  a  problem,  except  for  very  high 
dynamic  range  systems,  or  very  high  resolution  fil¬ 
ters.  A  rough  estimate  of  this  limit  can  be  made 
from  existing  scattering  data,  which  suggests  that 
for  the  best  optical  quality  oxide  crystals  such  as 
Te02,  forward  scattering  values  are  about  -50  dB; 
the  scattering  will  be  greater  for  most  infrared  AO 
crystals,  such  as  TAS  or  Hg2  CI2  .  Assuming  a 


direction. 

Figure  11:  Image  recorded  with  two  AOTFs  in 
series,  and  intensity  scan  compared  with  single 

cos^  intensity  dependence  with  respect  to  the  for¬ 
ward  direction,  scattering  into  the  diffracted  image 
acceptance  aperture  is  nearly  the  same  value  for 
typical  AOTF  designs.  The  ratio  of  scattered  light 
intensity  to  diffracted  image  signal  is  approxi¬ 
mately 
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where  S  -  scattering  coefficent,  <j)  -  scattering 
angle,  5A,  -  AOTF  resolution,  AA,  -  spectral  range 
of  light  source  and  detector,  p  -  polarization  loss, 
at  least  50%,  and  r]  -  AOTF  efficiency. 

For  the  AOTF  design  of  table  2,  S~10'^  ,  (AA/8X,)  = 
100,  p  =  0.5,  and  ri  =  0.5,  so  that  the  estimated 
scattered  light  intensity  is  about  24  dB  below  the 
image  signal.  Measurements  were  made  of  the 
scattering  level  from  the  experimental  AOTF’s 
using  a  He-Ne  laser  as  source,  and  scanning  the 
detector  in  the  image  plane.  These  results  are  in 
reasonably  good  agreement  with  this  estimate. 

4.  IR  Crystal  Growth  Activities 

For  AOTF  applications  in  the  visible  to  far  (to  20 
|im)  IR  range,  mercurous  chloride  has  very  attrac¬ 
tive  properties  [Barta,  1970].  However,  it  is  not 
available  commercially.  Nevertheless,  there  is  con¬ 
siderable  prior  knowledge  in  the  literature  originat¬ 
ing  from  C.  Barta  of  Czechoslovakia  [Barta  et  al., 
1975],  and  more  recently  from  a  Westinghouse 
Research  team,  led  by  N.B.  Singh  [Sing  et  al., 
1986],  about  the  seeded  vapor  growth  of  mercurous 
chloride.  Our  goal  is  to  establish  the  equipment 
and  the  hands-on  skills  needed  to  grow  mercurous 
chloride.  During  1996,  we  grew  two  crystals  from 
commercial  mercurous  chloride  powder  samples, 
one  from  Aldrich  and  the  second  from  Fisher.  Sig¬ 
nificant  improvement  of  crystal  formation  was 
made  from  the  first  to  the  second  growth,  primarily 
through  the  acquisition  of  a  new  turbo  pump  vac¬ 
uum  system.  We  used  a  two  zone  furnace  for  this 
growth.  Its  actual  temperature  profile  for  mercu¬ 
rous  chloride  growth  is  shown  in  Figure  12.  Fine 


Figure  12:  The  temperature  profile  of  the  crystal 
growth  furnace. 


adjustment  is  required  to  initiate  seeding  either  at  a 
random  site  (unseeded  growth)  or  on  a  suitably  ori¬ 
ented  pre-existing  crystal  (seeded  growth).  Only 
the  crystal  grown  from  Fisher  material  has  the 
appearence  of  a  single  christal.  Both  crystals  are 
awaiting  X-ray  characterization.  . 

5.  Future  Plans 

The  AOTF  that  we  designed  for  optimized  broad 
spectral  band  operation  has  been  built  to  our  speci¬ 
fications  and  delivered.  This  device  is  being  char¬ 
acterized  as  to  its  electrical  and  acousto-optical 
parameters  to  fully  exploit  its  capabilities  in  smart 
filter  scenarios.  We  will  address  the  important  issue 
of  minimizing  spectral  image  blur  with  a  new 
device  configuration  in  which  two  AOTF’s  are  used 
in  tandem  to  reduce  the  impact  of  spectral  side 
lobes.  We  will  then  proceed  with  the  major  task 
towards  combining  the  operation  of  the  AOTF  with 
computational  sensors.  In  the  crystal  growth  activ¬ 
ities,  we  plan  to  establish  seeded  growth  of  mercu¬ 
rous  chloride.  Key  tasks  include  setting  up  for 
crystal  orienting,  cutting,  and  polishing. 
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Abstract 

Our  Center  is  carrying  out  a  coordinated  pro¬ 
gram  to  develop  general-purpose  autonomous 
systems  for  vision,  object  recognition,  and  con¬ 
trol  applications.  The  systems  are  realized 
in  software,  oflF-the-shelf  hardware,  and  cus¬ 
tomized  chips.  These  systems  are  designed  to 
operate  within  noisy  environments  for  which 
rules  are  not  known  and  which  can  change  un¬ 
expectedly  through  time.  They  typically  be¬ 
gin  with  models  of  a  key  brain  competence  and 
end  with  fielded  applications  that  have  been 
thoroughly  benchmarked.  Projects  include  psy¬ 
chophysical  studies  of  how  humans  search  com¬ 
plex  scenes;  models  of  coherent  processing  of 
noisy  and  incomplete  image  data  from  natu¬ 
ral  and  artificial  sensors;  development  of  self¬ 
organizing  classifiers  capable  of  fast,  stable,  dis¬ 
tributed,  incremental  learning  and  hypothesis 
testing  in  response  to  nonstationary,  incom¬ 
plete,  and  probabilistic  data;  algorithm  and 
hardware  development  for  head-mounted  space- 
variant  active  vision  systems;  development  of 
self-calibrating  autonomous  robots;  and  fabri¬ 
cation  of  chips  for  vision  and  classification  ap¬ 
plications. 

1  Introduction 

This  report  summarizes  research  being  con¬ 
ducted  under  the  Multidisciplinary  University 
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Research  Initiative  (MURI)  program  by  the 
Boston  University  Department  of  Cognitive  and 
Neural  Systems,  the  Boston  University  College 
of  Engineering,  and  the  Johns  Hopkins  Uni¬ 
versity  Department  of  Electrical  and  Computer 
Engineering.  Our  MURI  Center  is  develop¬ 
ing  general-purpose  autonomous  neural  systems 
for  vision,  object  recognition  and  control  ap¬ 
plications.  These  systems  are  designed  to  op¬ 
erate  within  the  types  of  uncontrolled  envi¬ 
ronments  that  are  typified  by  the  battlefield. 
Such  environments  may  contain  rare  but  im¬ 
portant  events  whose  consequences  differ  from 
those  of  similar  frequent  events,  as  well  as  un¬ 
expected  combinations  of  events,  irregular  sta¬ 
tistical  drifts  in  event  sequences,  and  different 
amounts  of  morphological  variability  in  objects 
to  be  detected,  recognized,  and  controlled. 

Our  projects  typically  begin  by  modelling  a  key 
biological  competence.  These  models  aspire  to 
be  general-purpose  solutions  of  modal  problems, 
such  as  vision,  adaptive  pattern  recognition, 
and  adaptive  sensory-motor  control.  We  are 
guided  in  our  model  development  by  the  typ¬ 
ically  huge  psychophysical  and  neurobiological 
data  bases  in  these  fields.  These  data  bases  are 
the  first  explanatory  and  predictive  targets  of 
the  model.  For  example,  we  developed  a  new 
model  of  how  the  visual  cortex  is  organized  into 
layers,  columns,  maps,  networks,  and  successive 
processing  levels  to  generate  context-sensitive 
boundary  segmentations  of  image  data.  Where 
sufficient  data  are  not  available  and  the  compe¬ 
tence  is  important,  we  collect  and  analyse  the 
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data  ourselves.  Our  mathematical  and  compu¬ 
tational  analyses  then  characterize  each  model’s 
functional  properties.  Once  they  are  mathe¬ 
matically  understood,  the  models  can  be  mod¬ 
ified  and  optimized  for  a  wide  variety  of  appli¬ 
cations.  Earlier  versions  of  these  models  have 
rapidly  been  applied  by  technologists  because 
they  exhibit  human-compatible  properties  of 
autonomous  adaptation  or  performance  in  re¬ 
sponse  to  various  types  of  changing  environmen¬ 
tal  conditions. 

In  order  to  optimize  the  models  for  applica¬ 
tions,  they  are  tested  by  being  benchmarked 
against  competing  approaches.  Often  there 
are  no  competing  approaches,  because  vari¬ 
ous  of  the  models  have  combinations  of  de¬ 
sirable  self-organizing  properties  that  have  not 
yet  been  achieved  elsewhere.  For  example, 
the  Distributed  ARTMAP  algorithm  summa¬ 
rized  herein  joins  properties  of  fast,  stable,  dis¬ 
tributed,  and  incremental  learning  and  hypoth¬ 
esis  testing  for  classification  of  arbitrarily  large 
amounts  of  nonstationary  data.  This  combi¬ 
nation  of  properties  has  not  elsewhere  been 
achieved,  to  the  best  of  our  knowledge.  After 
the  models  are  successfully  benchmarked,  they 
are  then  realized  in  real-time  software,  off-the- 
shelf  hardware,  or  custom  VLSI  chips. 

Illustrative  applications  that  are  reviewed  here 
include  boundary  and  surface  processing  of  nat¬ 
ural  and  synthetic  images,  development  of  self¬ 
organizing  classifiers  of  textured  synthetic  aper¬ 
ture  radar  (SAR)  images,  geospatial  mapping 
from  satellite  remote  sensing  data,  medical  pre¬ 
diction  in  the  field,  radar  range  profile  target 
recognition,  automatic  generation  of  coherent 
and  attentive  representations  of  object  motion, 
continuous  motion  tracking  in  response  to  spa¬ 
tially  and  temporally  discontinuous  signals,  fu¬ 
sion  of  form  and  motion  data  to  predict  object 
motion  in  noisy  environments  that  could  not  be 
achieved  using  motion  data  only,  psychophysical 
experiments  to  determine  how  human  observers 
search  complex  scenes,  adaptive  multimodal  fu¬ 
sion  of  visual,  auditory,  and  planned  move¬ 
ment  commands  for  attentive  control  of  ballis¬ 
tic  movements,  algorithmic  and  hardware  devel¬ 
opment  of  head  mounted  space-variant  active 
vision  systems,  navigation  by  self-calibrating 


robots  under  visual  guidance,  and  the  devel¬ 
opment  of  neuromorphic  VLSI  for  vision  and 
adaptive  classification  applications. 

2  Boundary  and  Surface  Processing 
of  Natural  and  Synthetic  Images 

2.1  Boundary  Segmentation  and 
Surface  Representation 

Automatic  boundary  segmentation  and  surface 
reconstruction  of  noisy  and  cluttered  scenes  re¬ 
main  key  problems  for  applications.  Such  pre¬ 
processing  is  needed  for  use  by  expert  photoin¬ 
terpreters,  by  non-expert  users  of  SAR,  multi- 
spectral  infrared  (IR),  and  laser  detection  and 
ranging  (LADAR)  imagery  in  battlefield  condi¬ 
tions,  and  as  a  preprocessor  for  such  image  data 
before  it  is  automatically  classified  by  adaptive 
pattern  recognition  algorithms.  Nonparametric 
methods  that  can  cope  with  arbitrary  images 
are  needed  to  deal  with  many  battlefield  situ¬ 
ations.  Our  approach  to  boundary  segmenta¬ 
tion  provides  nonparametric  multiple-scale  data 
fusion  and  parallel  decision-making  algorithms 
whose  design  principles  and  circuit  mechanisms 
can  be  generalized  to  many  problems. 

This  project  continues  the  development  of 
boundary  segmentation  circuits  with  the  fol¬ 
lowing  properties:  (1)  automatic  compensation 
for  variable  illumination  gradients  (“discount 
the  illuminant”),  (2)  suppression  of  noise  in 
a  context-sensitive  fashion,  (3)  completion  of 
boundary  groupings  in  response  to  mixtures  of 
noisy  edges,  textures,  and  shading,  (4)  com¬ 
pletion  of  noise-free  surface  representations  us¬ 
ing  the  outputs  of  (l)-(3).  The  resulting  al¬ 
gorithm  has  been  tested,  for  example,  on  SAR 
imagery  of  wooded  scenes  with  man-made  roads 
from  MIT  Lincoln  Laboratory.  The  SAR  images 
were  obtained  using  a  35-GHz  synthetic  aper¬ 
ture  radar  with  1  foot  by  1  foot  resolution  and  a 
slant  range  of  7  km  [Novak  et  al.,  1990].  Ear¬ 
lier  versions  of  the  algorithm  have  been  used  by 
Lincoln  Lab,  among  others,  to  preprocess  SAR, 
multispectral  IR,  and  LADAR  images,  and  sim¬ 
ilar  technology  transfers  are  anticipated  for  the 
new  algorithm.  The  new  circuits  are  simpler 
computationally,  run  faster,  and  provide  better 
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noise  suppression,  boundary  localization,  and 
boundary  completion  properties.  These  circuits 
form  the  backbone  of  a  larger  circuit  under  de¬ 
velopment  which  also  (5)  separates  objects  from 
each  other  and  from  their  backgrounds  in  3-D, 
and  (6)  completes  representations  of  partially 
occluded  objects  in  response  to  both  3-D  scenes 
and  2-D  pictures  (see  Section  4).  These  bound¬ 
ary  and  surface  circuits  form  the  front  end  of 
architectures  that  include  self-organizing  pat¬ 
tern  recognition  algorithms  for  incrementally 
learning  to  classify  scenic  objects,  textures,  and 
image  understanding  interpretations  (see  Sec¬ 
tions  3  and  5). 

2.2  Boundary  Segmentation:  From 
Visual  Cortex  to  lU  Algorithm 

Sources  for  our  new  lU  designs  are  the  huge 
experimental  literatures  on  human  visual  psy¬ 
chophysics  and  primate  neuroscience.  Under¬ 
standing  how  the  visual  cortex  is  organized  to 
yield  the  properties  of  visual  perception  is  one 
of  the  outstanding  questions  in  the  science  and 
technology  of  vision.  The  visual  cortex  is  or¬ 
ganized  into  layers,  columns,  maps,  networks, 
and  successive  processing  stages.  One  project 
is  developing  a  computational  model  of  how  all 
these  structures  are  organized  for  purposes  of 
boundary  segmentation  in  the  first  three  pro¬ 
cessing  stages  (lateral  geniculate  nucleus  (LGN) 
and  interblob  cortical  areas  VI  and  V2)  be¬ 
yond  the  photosensitive  retina.  This  model  sug¬ 
gests  how  the  visual  cortex  elegantly  accom¬ 
plishes  the  image  processing  goals  (l)-(3)  stated 
above  [Grossberg,  Mingolla,  and  Ross,  1997]  us¬ 
ing  compact  and  modular  circuits.  These  cir¬ 
cuits  provide  functional  explanations  for  identi¬ 
fied  cortical  cells  and  connections,  and  have  sim¬ 
ulated  challenging  psychophysical  and  neuro¬ 
physiological  data.  The  circuits  have  also  been 
used  to  successfully  process  SAR  data,  and  will 
be  incorporated  into  the  next  generation  of  vi¬ 
sion  chips  from  our  Center’s  VLSI  (very  large 
scale  integrated  circuit)  team. 

A  schematic  circuit  diagram  of  this  FACADE 
network  is  given  in  Figure  1.  Key  design  fea¬ 
tures  are  summarized  because  of  the  break¬ 
through  nature  of  these  results.  Neural  labels 


Figure  1:  Schematic  of  LGN-V1-V2  model 
circuitry.  The  V2  circuit  replicates  the  VI  cir¬ 
cuit  but  at  a  larger  spatial  scale.  Open  symbols 
=  excitatory,  closed  symbols  =  inhibitory. 

axe  used  for  definiteness:  (1)  The  circuit  re¬ 
tains  analog  sensitivity  to  distributed  image  fea¬ 
tures,  even  as  it  generates  coherent  and  context- 
sensitive  boundary  segmentations.  (2)  Several 
types  of  feedback  circuits,  which  equilibrate 
very  quickly  (in  1-3  cycles),  ensure  this  ana¬ 
log  sensitivity.  (3)  One  type  of  feedback  circuit 
operates  within  each  brain  region.  It  generates 
the  groupings  whereby  edge,  texture,  shading, 
and  stereo  information  are  bound  together  into 
coherent  segmentations.  Such  circuits  use  di¬ 
rect  long-range  horizontal  cooperation  and  two- 
stage  (i.e.,  disynaptic)  short-range  competition 
within  the  complex  cells  of  cortical  layer  3  to 
realize  a  bipole  property  (see  Figure  2A).  The 
bipole  property  enables  a  complex  cell  to  fire 
when  it  lies  between  nearly  aligned  inducing 
signals,  but  not  when  it  lies  beyond  a  single 
inducer.  (4)  Another  type  of  cooperative  and 
competitive  interaction  works  with  the  bipole 
property  to  generate  context-sensitive  boundary 
groupings.  Excitatory  inputs  from  LGN  arrive 
in  area  VI  at  layers  4  and  6  (Figure  2B).  LGN 
inputs  directly  excite  orientationally  tuned  sim¬ 
ple  cells  in  layer  4.  In  particular,  oriented  arrays 
of  spatially  displaced  LGN  ON  and  OFF  cells 
excite  mutually  inhibitory  simple  cells  that  are 
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Figure  2:  Model  retinal,  VI,  and  LGN  circuit. 
See  text  for  details. 

sensitive  to  the  same  orientation  but  opposite 
contrast  polarities  (not  shown  in  Figure  2).  Af¬ 
ter  layer  6  cells  are  activated,  they,  in  turn,  both 
excite  and  inhibit  layer  4.  The  net  effect  is  that 
LGN  influences  layer  4  via  a  feedforward  on- 
center  off-surround  network  of  cells  (Figure  2B) 
that  obey  membrane,  or  shunting,  equations. 
This  excitatory-inhibitory  balance  enables  layer 
4  simple  cells  to  maintain  their  analog  sensitiv¬ 
ity  to  visual  inputs  of  variable  contrast. 

The  next  interactions  close  the  feedback  loop 
that  accomplishes  boundary  segmentation:  (6) 
Layer  4  cells  activate  cells  in  layer  3  (Figure  2B), 
which  then  attempt  to  cooperate  using  their 
long-range  horizontal  connections  and  short- 
range  disynaptic  inhibition.  All  activated  cells 
feed  back  to  layer  6  (Figure  2C).  Layer  3  hereby 
gains  access  to  the  on-center  off-surround  net¬ 
work  of  connections  from  layer  6  to  layer  4. 

(7)  The  long-range  bipole  grouping  in  layer  3 
can  use  the  shorter-range  layer  6-to-4  signals 
to  amplify  those  cell  activations  that  are  fa¬ 
vored  by  bipole  grouping  and  suppress  those 


that  are  not,  while  maintaining  analog  sensitiv¬ 
ity.  Layer  6-to-4  inhibition  influences  different 
orientations  and  positions  by  being  distributed 
across  a  data  structure  (called  a  hypercolumn 
map)  wherein  cells  sensitive  to  these  features  are 
spatially  organized.  Using  this  data  structure, 
the  short-range  competition  can  relatively  en¬ 
hance  cell  responses  cooperating  in  positional, 
orientational,  and  length-sensitive  groupings  by 
suppressing  cells  responding  to  weaker  group¬ 
ings,  incoherent  noise,  or  background  signals. 
Thus,  the  same  circuit  that  maintains  sensitiv¬ 
ity  to  bottom-up  inputs  also  does  so  in  response 
to  top-down  signals,  and  in  so  doing  closes  the 
fast  feedback  loop  that  generates  boundary  seg¬ 
mentations. 

(8)  The  second  type  of  feedback  circuit  oper¬ 
ates  between  successive  brain  regions.  For  ex¬ 
ample,  after  LGN  activates  layers  4  and  6  in 
area  VI,  feedback  from  layer  6  in  area  VI  to 
LGN  selects  and  synchronizes  LGN  activities 
that  are  consistent  with  cortical  cell  activity, 
and  suppresses  all  other  LGN  cells.  This  feed¬ 
back  quickly  focuses  attention  upon  the  selected 
LGN  cells  and  increases  the  visual  information 
transmitted  from  LGN  to  cortex.  Thus,  the 
same  feedback  signals  from  layer  3  to  6  that 
start  to  generate  segmentations  in  VI  also  se¬ 
lectively  enhance  the  LGN  inputs  that  will  sup¬ 
port  the  segmentation  process.  We  have  pro¬ 
posed  that  similar  attentional  circuits  operate 
at  all  levels  of  the  visual  cortex,  including  areas 
V2  and  V4,  in  order  to  explain  neurophysiolog¬ 
ical  data  from  these  areas.  We  also  predict  that 
these  top-down  attentional  circuits  stabilize  the 
learning  process  whereby  bottom-up  adaptive 
filters  autonomously  tune  their  parameters  in 
response  to  changing  input  statistics. 

(9)  Another  efficient  design  links  areas  VI  and 
V2.  The  model  proposes  that  VI  and  VI  are 
self-similar  copies  of  one  another,  but  that  V2 
has  longer-range  interactions  than  VI.  Various 
data  support  this  hypothesis.  Groupings  across 
the  smaller  scales  in  VI  enhance  responses  that 
code  for  mutually  consistent  boundary  locations 
and  orientations,  while  larger-scale  groupings  in 
V2  supports  long-range  boundary  completion 
across  textures  and  large  pixel  drop-outs,  which 
in  the  brain  are  due  to  the  blind  spot  and  veins 
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of  the  retina.  Layer  3  in  VI  activates  layers  4 
and  6  in  V2,  much  as  LGN  activates  VI. 

2.3  Enhancement  of  Synthetic 
Aperture  Radar  Images 

The  boundary  segmentation  system  is  part 
of  a  Boundary  Contour  System  (BCS).  The 
BCS  interacts  with  a  surface  representation  sys¬ 
tem  that  is  called  the  Feature  Contour  Sys¬ 
tem  (FCS).  Both  the  BCS  and  the  FCS  re¬ 
ceive  input  signals  only  after  the  illuminant  has 
been  discounted  by  a  shunting  on-center  off- 
surround  network.  BCS  boundaries  block  the 
diffusion  of  discounted  inputs  to  FCS  compart¬ 
ments.  This  version  of  anisotropic  diffusion  [Co¬ 
hen  and  Crossberg,  1984;  Crossberg  and  Todor- 
ovic,  1988]  can  achieve  more  precise  surface  rep¬ 
resentations  than  schemes  which  do  not  fully  ex¬ 
ploit  the  fact  that  boundaries  and  surfaces  are 
formed  using  complementary  processing  rules 
[Crossberg,  Mingolla,  and  Todorovic,  1989].  In 
particular,  FCS  diffusion  is  restricted  not  only 
by  the  presence  of  high  intensity  gradient  val¬ 
ues  in  the  original  image,  but  also  by  long-range 
colinear  (or  nearly  colinear)  groupings  of  im¬ 
age  gradient  signals.  These  coherent  boundaries 
determine  which  fluctuations  in  pixel  intensity 
will  be  defined  as  noise  and  smoothed  over,  and 
which  will  be  enhanced  as  image  structures. 
This  coherence  property  becomes  especially  im¬ 
portant  in  the  presence  of  the  high  noise,  pixel 
drop  out,  and  incomplete  image  data  that  may 
characterize  battlefield  conditions. 

Om  first  application  of  the  new  BCS/FCS 
model  was  used  to  enhance  images  of  range  data 
from  a  SAR  sensor.  The  model  was  used  to 
make  structures  such  as  motor  vehicles,  roads, 
and  buildings  more  salient  to  human  observers 
than  they  are  in  the  original  imagery,  thereby 
making  SAR  imagery,  and  related  types  of  im¬ 
ages  useful  in  battlefield  situations,  to  individ¬ 
uals  without  extensive  training  as  photointer¬ 
preters.  The  shunting  network  performs  a  lo¬ 
cal  normalization  of  dynamic  range  in  one  pro¬ 
cessing  step.  The  new  BCS  algorithm  generates 
positionally  more  accurate  boundaries  with  sig¬ 
nificant  reductions  in  processing  time  and  algo¬ 
rithmic  complexity  over  previous  BCS  circuits. 


The  BCS-gated  diffusion  process  in  the  FCS  re¬ 
duces  speckle  noise,  smoothes  image  brightness 
in  a  form-sensitive  way,  and  enhances  contrast- 
differences  between  different  image  surfaces  (see 
Figure  3).  The  BCS/FCS  algorithm  outper¬ 
forms  alternative  published  algorithms  for  im¬ 
age  enhancement,  as  detailed  for  an  earlier 
version  of  the  algorithm  in  Crossberg,  Min¬ 
golla,  and  Williamson  [1995].  For  example,  the 
BCS/FCS  algorithm  typically  converges  in  1- 
3  iterations  to  a  stable  image  reconstruction 
as  its  algorithm  is  iterated.  In  contrast,  the 
correct  number  of  iterations  of  median,  sigma, 
and  geometric  filters  is  image-dependent.  The 
new  BCS  algorithm  runs  in  approximately  20% 
of  the  time  required  for  the  published  algo¬ 
rithm  of  Crossberg,  Mingolla,  and  Williamson 
[1995]  on  comparable  hardware.  Dr.  Allen  Wax- 
man’s  group  had  previously  reported  processing 
times  of  approximately  50  seconds  on  DEC  Al¬ 
pha  stations  for  multiscale  processing  of  large 
(512  X  512)  SAR  images,  using  a  simplified  and 
optimized  form  of  our  earlier  algorithms  [Wax- 
man  et  al,  1993].  Processing  times  of  about 
10  seconds  per  image  for  personal  workstations 
thus  appear  realistic  for  our  present  algorithm 
with  a  comparable  optimization  effort.  Earlier 
versions  of  this  technology  have  been  applied 
and  developed  by  Waxman’s  group  for  military 
applications.  His  work  is  now  funded  under  the 
Integrated  Imagers  Initiative  of  DARPA/ETO. 

3  Gaussian  ARTMAP  and  ARTEX 
Classifiers 

3.1  Gaussian  ARTMAP  versus 
Expectation  Maximization 
Classifiers 

The  BCS  boundary  and  FCS  surface  represen¬ 
tations  are  often  used  to  generate  output  vec¬ 
tors  that  are  input  to  a  self-organizing  pat¬ 
tern  categorizer,  or  classifier.  Such  a  sys¬ 
tem  can  autonomously  classify  data  from  mul¬ 
tiple  sensor  types,  or  can  aide  a  human  ob¬ 
server  in  his/her  classification  performance. 
Many  projects  within  our  Center  are  develop¬ 
ing  Adaptive  Resonance  Theory  (ART)  classi¬ 
fiers,  because  they  combine  a  series  of  proper¬ 
ties  that  are  of  importance  in  battlefield  situa- 


U95 


Figure  3:  (A)  Top  left:  Unprocessed  SAR  im¬ 
age  of  upstate  New  York  scene  consisting  of 
highway  with  bridge  overpass.  (B)  Top  right: 
logarithm-transformed  SAR  image.  (C)  Bot¬ 
tom  left:  Center-surround,  shunting  network  re¬ 
sult  averaged  across  spatial  scales.  (D)  Bottom 
right:  New  BCS/FCS  multi-scale  enhancement. 

tions.  These  include  the  ability  to  incrementally 
learn  stable  recognition  categories  in  response  to 
an  essentially  unlimited  number  of  rare  events, 
unexpected  events,  irregular  statistical  drifts  in 
event  sequences,  and  different  amounts  of  mor¬ 
phological  variability  in  the  events  to  be  clas¬ 
sified.  Present  research  is  focusing  on  how  to 
simultaneously  realize  the  constraints  of  fast, 
stable,  and  distributed  incremental  learning  in 
response  to  an  arbitrary  nonstationary  eniron- 
ment  of  arbitrary  finite  size.  All  other  algo¬ 
rithms  known  to  us  fail  on  one  of  more  of  these 
criteria.  See  Section  5  for  further  details. 

The  present  project  developed  a  new  ART  clas¬ 
sifier  with  which  to  process  textured  SAR  scenes 
after  they  are  preprocessed  by  the  BCS/FCS 
model.  This  classifier  is  called  Gaussian 
ARTMAP,  or  GAM.  We  subsequently  showed 
that  this  classifier  outperforms  state-of-the  art 
statistical,  rule-based,  and  multilayer  percep- 
tron  algorithms  on  a  number  of  other  bench¬ 
mark  data  bases,  including:  letter  image  recog¬ 
nition  [Frey  and  Slate,  1991],  Landsat  satel¬ 
lite  image  segmentation  [Feng  et  ai,  1993], 
speaker  independent  vowel  recognition  [Deterd- 


ing,  1989],  and  natural  texture  databases  [Bro- 
datz,  1966],  in  addition  to  obtaining  accurate, 
high  resolution  classification  of  image  regions 
in  response  to  BCS/FCS  processed  SAR  data 
[Grossberg  and  Williamson,  1997;  Williamson, 
1996,  1997]. 

As  with  other  ART  algorithms,  GAM  incremen¬ 
tally  self-organizes  an  internal  categorization  of 
its  inputs,  and  maps  those  categories  to  output 
class  predictions.  GAM  differs  from  other  ART 
networks  by  using  internal  category  nodes  that 
have  Gaussian  receptive  fields.  That  is,  each 
GAM  category  defines  a  Gaussian  distribution 
in  the  input  space,  with  a  mean  and  variance 
in  each  input  dimension,  as  well  as  an  overall 
a  priori  probability.  GAM’s  activation  function 
evaluates  the  posterior  probability  that  the  in¬ 
put  belongs  to  a  category,  while  GAM’s  match 
function  evaluates  how  well  the  input  fits  the 
category’s  distribution.  Match  is  a  measure  of 
the  distance,  in  units  of  standard  deviation,  be¬ 
tween  the  input  vector  and  the  category’s  mean. 
The  network’s  vigilance  parameter  specifies  the 
maximum  allowable  size  of  this  distance. 

The  original  GAM  network  used  choice  learn¬ 
ing,  in  which  only  the  maximally  activated  cate¬ 
gory  would  learn  [Williamson,  1996].  GAM  now 
uses  distributed  learning,  in  which  each  cate¬ 
gory  is  assigned  credit  based  on  its  proportion  of 
the  net  category  activation  [Williamson,  1997]. 
The  net  activation  is  determined  by  all  cate¬ 
gories  that  exceed  vigilance  and  that  belong  to 
the  correct  ensemble.  An  ensemble  consists  of 
the  set  of  categories  that  map  to  the  same  out¬ 
put  prediction.  When  an  input  is  presented,  the 
maximally  activated  ensemble  is  chosen.  If  its 
prediction  is  correct,  the  ensemble’s  categories 
learn.  If  its  prediction  is  incorrect,  then  match 
tracking  takes  place:  the  categories  in  the  cho¬ 
sen  ensemble  are  reset,  or  deactivated,  and  vig¬ 
ilance  is  raised.  This  triggers  a  search  which 
continues  until  a  correct  or  new  ensemble  is  cho¬ 
sen. 

Distributed  GAM  uses  fewer  categories  and  ob¬ 
tains  higher  classification  rates  than  choice¬ 
learning  GAM.  We  have  also  shown  that  dis¬ 
tributed  GAM  is  a  constructive,  incremental 
variant  of  an  Expectation  Maximization,  or 
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EM,  algorithm  for  mixture  modeling  and  clas¬ 
sification  [Dempster,  Laird,  and  Rubin,  1977; 
Ghahramani  and  Jordan,  1994].  This  batch 
EM  algorithm  learns  a  Gaussian/multinomial 
mixture  model,  with  a  locally  maximized  like¬ 
lihood,  of  the  joint  input/output  space.  EM 
then  generates  output  predictions  based  on  this 
model.  We  have  compared  GAM  and  EM  on 
three  real-world  classification  databases:  let¬ 
ter  image  recognition,  Landsat  satellite  image 
segmentation,  and  speaker  independent  vowel 
recognition.  GAM  outperforms  EM,  as  well  as 
a  wide  range  of  neural  network  and  machine 
learning  classifiers,  on  all  three  problems.  Ta¬ 
ble  1  summarizes  the  best  results  obtained  by 
GAM  and  EM. 

In  addition  to  its  higher  accuracy,  GAM  also 
possesses  the  practical  advantages  of  construc¬ 
tive,  incremental  learning.  For  example,  obtain¬ 
ing  the  best  results  on  all  three  problems  listed 
in  Table  1  requires  a  wide  range  in  the  number 
of  internal  categories.  With  EM,  a  user  needs 
to  choose  this  number  through  trial  and  error 
for  each  problem.  GAM,  on  the  other  hand,  au¬ 
tomatically  constructs  the  appropriate  number 
of  categories  for  each  problem.  Because  GAM 
learns  incrementally,  it  can  also  be  applied  to 
situations  that  require  immediate  updating  of 
input /output  mappings  as  data  are  received,  in¬ 
cluding  the  learning  of  arbitrary  new  contingen¬ 
cies,  such  as  the  addition  of  new  class  labels. 

3.2  ARTEX  Incremental 

Classification  of  Natural  and 
SAR  Textured  Scenes 

Another  research  accomplishment  is  the  devel¬ 
opment  of  ARTEX,  a  self-organizing  system 
that  learns  to  recognize  textured  scenes  [Gross- 
berg  and  Williamson,  1997].  The  ARTEX  ar¬ 
chitecture,  depicted  in  Figure  4  uses  a  simpli¬ 
fied  BCS/FCS  network  whose  17-dimensional 
output  vectors  are  fed  into  the  GAM  network, 
which  incrementally  learns  class  predictions. 
ARTEX  replaces  the  feedback  loops  of  the  BCS 
by  a  one-shot,  fast,  feedforward  approximation 
that  is  75  times  faster  than  the  full  BCS.  This 
approximation  loses  some  of  the  photorealism 
achieved  by  the  full  BCS/FCS  algorithm  for 


Figure  4:  ARTEX  model.  See  text  for  details. 


purposes  of  human  use,  but  achieves  faster  ma¬ 
chine  classification  on  natural  and  SAR  textures 
without  a  loss  of  accuracy.  In  particular,  AR¬ 
TEX  uses  part  of  the  BCS  to  extract  context- 
sensitive  texture  features.  This  Static  Oriented 
Contrast,  or  SOC,  filter  is  computed  at  four  ori¬ 
entations  and  spatial  scales  (Stages  1-5  in  Fig¬ 
ure  4) .  One  of  these  scales  is  used  to  generate  a 
surface  representation  (at  Stage  9  in  Figure  4) 
by  gating  diffusion  of  signals  from  the  center- 
surround  network  (at  Stage  1)  that  discounts 
the  illuminant.  The  16-dimensional  texture  fea¬ 
ture  and  1-dimensional  surface  brightness  fea¬ 
ture  form  the  input  vector  to  GAM.  The  OV 
and  01  options  in  Figure  4  are  orientationally 
variant  and  invariant  pathways  to  the  classifier, 
respectively,  which  are  nsed  in  different  classes 
of  applications.  We  have  exhaustively  bench- 
marked  variations  of  these  input  vectors. 

We  have  also  compared  ARTEX  to  a  recent 
state-of-the-art  algorithm  called  the  Hybrid 
System,  which  is  a  hybrid  architecture  that 
uses  a  log-Gabor  pyramid  for  feature  extraction 
followed  by  one  of  three  alternative  classifiers 
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Comparison  of  GAM  and  EM 


Classification 

Benchmark 

GAM 

EM 

Accuracy 

#  Categories 

Accuracy 

#  Categories 

Letter  Images 

94.6% 

941 

92.6% 

1,000 

Landsat  Images 

90.0% 

255 

89.4% 

275 

Spoken  Vowels 

56.0% 

54 

54.6% 

40 

Table  1:  Results  obtained  by  GAM  and  EM  on  three  real-world  classification  problems. 


[Greenspan  et  al,  1994;  Greenspan,  1996].  Ta¬ 
ble  2  shows  the  results  of  ARTEX  and  the  Hy¬ 
brid  System  on  comparable  benchmarks  of  high 
resolution  classification  of  30  natural  textures. 
ARTEX  outperforms  all  variations  of  the  Hy¬ 
brid  System,  which  include  a  rule  based  classi¬ 
fier  (ITRULE),  a  multilayer  perceptron  (MLP), 
and  a  K- nearest  neighbor  classifier  (K-NN).  We 
have  expanded  the  benchmarks  to  include  clas¬ 
sification  of  up  to  42  textures,  and  found  that 
ARTEX  scales  up  gracefully,  with  only  a  slight 
decrease  in  accuracy  and  a  slight  increase  in 
memory  per  texture,  as  the  number  of  textures 
increases.  ARTEX  maintains  well  over  90%  ac¬ 
curacy  even  on  the  42-texture  benchmark. 

GAM  shares  the  advantages  of  the  three  clas¬ 
sifiers  used  in  the  Hybrid  System  without  their 
serious  drawbacks.  Like  ITRULE,  GAM  pre¬ 
dicts  the  posterior  probabilities  of  the  output 
classes.  However,  GAM  uses  a  simple,  in¬ 
cremental  learning  procedure,  as  opposed  to 
ITRULE’s  complex  and  computationally  expen¬ 
sive  batch  learning  procedure.  Like  K-NN, 
GAM  learns  local  mappings  quickly.  However, 
GAM  also  achieves  significant  data  compres¬ 
sion,  unlike  K-NN.  MLP  also  achieves  signifi¬ 
cant  data  compression,  but  it  learns  very  slowly, 
requiring  100  times  more  training  epochs  on  the 
30-texture  benchmark.  In  addition,  MLP  uses  a 
form  of  mismatch  learning  that  is  susceptible  to 
“catastrophic  forgetting”  if  it  is  trained  on  new 
data  with  different  contingencies  from  previous 
data. 

4  FACADE  Model  of  3-D  Vision 
and  Figure- Ground  Separation 

One  of  the  most  challenging  problems  in  vision 
concerns  how  the  brain  automatically  separates 


objects  from  each  and  their  backgrounds  in  re¬ 
sponse  to  static  2-D  pictures  and  3-D  scenes. 
Such  preprocessing  is  especially  important  for 
lU  of  partially  occluded  objects.  Recognition 
is  far  easier  in  response  to  preprocessed  images 
that  separate  occluding  from  occluded  objects 
and  complete  the  boundary  and  surface  rep¬ 
resentations  of  the  latter.  Such  completion  is 
often  said  to  be  amodal  because  it  influences 
recognition  of  a  partially  occluded  object  with¬ 
out  causing  a  corresponding  visible  percept. 

We  have  been  progressively  developing  a  com¬ 
putational  model  of  how  the  visual  cortex 
achieves  this  competence,  and  have  used  it 
to  explain  many  psychophysical  and  neuro- 
biological  data  about  3-D  vision  [Grossberg, 
1994,  1997;  Grossberg  and  McLoughlin,  1997; 
McLoughlin  and  Grossberg,  1997].  This  is 
called  a  FACADE  model  because  it  generates  a 
representation  of  Form-And-Color-And-DEpth. 
The  FACADE  model  generalizes  BCS  bound¬ 
ary  segmentations  and  FCS  surface  representa¬ 
tions  to  the  case  of  3-D  vision.  The  model  inter¬ 
prets  data  from  the  parallel  interblob  (bound¬ 
ary)  and  blob  (surface)  processing  streams  that 
begin  in  the  LGN  and  pass  through  cortical  ar¬ 
eas  VI,  V2,  and  V4.  Output  vectors  from  the 
FACADE  model  to  a  pattern  classifier  repre¬ 
sent  completed  representations  of  overlapping 
occluding  and  occluded  objects  lying  on  separa¬ 
ble  boundary  and  surface  representations  that 
represent  different  depths  from  an  observer. 

Figure  5  shows  a  macrocircuit  of  this  architec¬ 
ture.  In  this  macrocircuit,  left  eye  (or  camera) 
and  right  eye  (or  camera)  monocular  prepro¬ 
cessing  stages  (MPL  and  MPR)  in  the  LGN 
send  parallel  pathways  to  the  BCS  (boxes  with 
vertical  lines,  designating  oriented  responses  of 
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Comparison  of  ARTEX  and  Hybrid  System 


Configuration 

Accuracy 

#  Samples/Class 

#Epochs 

#  Categories 

#  Weights 

Hybrid  System,  ITRULE 

80.0% 

300 

Batch 

— 

— 

Hybrid  System,  MLP 

89.6% 

300 

500 

60 

2,700 

Hybrid  System,  K-NN 

82.0% 

300 

1 

9,000 

144,000 

ARTEX 

92.3% 

300 

5 

214 

7,697 

ARTEX 

94.4% 

768 

2 

374 

13,478 

Table  2:  Results  of  ARTEX  and  the  Hybrid  System  on  comparable  30-texture  benchmarks. 


multiple  scales)  and  the  FCS  (boxes  with  three 
pairs  of  circles,  designating  opponent  colors). 
The  monocular  signals  BCS/,  and  BCS/j  acti¬ 
vate  oriented  simple  cells  which,  in  turn,  ac¬ 
tivate  bottom- up  pathways,  labeled  1,  to  gen¬ 
erate  a  multiple-scale  binocular  boundary  seg¬ 
mentation.  The  binocular  segmentation  gener¬ 
ates  output  signals  to  the  monocular  filling-in 
domains,  or  FIDOs,  of  the  FCS  via  pathways 
labeled  2.  This  interaction  captures  monocular 
FCS  signals  that  are  consistent  with  the  binocu¬ 
lar  BCS  boundaries  and  lifts  them  into  depthful 
surface  representations.  All  other  FCS  signals 
are  suppressed.  Reciprocal  FCS-to-BCS  inter¬ 
actions  enhance  the  boundaries  that  success¬ 
fully  created  filled-in  surfaces  and  suppresses 
boundaries  corresponding  to  more  distant  sur¬ 
faces.  This  feedback  loop  achieves  boundary- 
surface  consistency.  In  so  doing,  it  realizes  an 
asymmetry  between  the  processing  of  near  and 
far  objects  that  helps  to  explain  how  occlud¬ 
ing  objects  pop-out.  The  surviving  FCS  signals 
activate  the  binocular  FIDOs  via  pathways  3, 
where  they  interact  with  an  augmented  binocu¬ 
lar  BCS  segmentation  to  fill-in  a  multiple-scale 
FACADE  representation,  which  represents  vis¬ 
ible  percepts  of  3-D  surfaces.  This  representa¬ 
tion  is  compared  with  data  about  cortical  area 
V4.  The  model  explains  how  these  functional 
properties  emerge  automatically  from  network 
interactions. 

Recent  projects  have  been  computationally  de¬ 
veloping  the  model  using  psychophysical  and 
neurobiological  data.  As  with  our  other 
biologically-based  vision  projects,  this  will  con¬ 
tinue  until  the  architecture  is  mature  enough  to 
be  transferred  to  military  and  other  high  tech¬ 
nology  applications.  As  illustrated  by  the  cor- 
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Figure  5:  A  FACADE  macrocircuit.  See  text 
for  details. 

tical  BCS  model  above,  technology  transfer  can 
occur  very  rapidly  with  these  models  as  soon  as 
they  are  completely  characterized  computation¬ 
ally.  The  latest  projects  have  explained  data 
about:  (1)  How  human  observers  achieve  binoc¬ 
ular  fusion  in  response  to  ambiguous  binocular 
stimuli,  such  as  Panum’s  limiting  case,  dichop- 
tic  masking  stimuli,  and  half-occluded  images 
in  which  only  one  eye  may  detect  part  of  a 
scene  due  to  its  3-D  layout  (daVinci  stereopsis) 
[Grossberg  and  McLoughlin,  1997;  McLoughlin 
and  Grossberg,  1997].  Other  computational  vi¬ 
sion  models  have  failed  to  explain  how  humans 
assign  the  correct  depth  to  half-occluded  objects 
that  are  seen  with  only  one  eye.  This  project 
has  further  developed  the  model’s  multiple-scale 
binocular  filter.  The  filter  clarifies  how  humans 
match  spatially  disparate  image  features  to  the 
two  eyes  which  have  the  same  contrast  polarity, 
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(a)  (b) 


Figure  6:  Relative  contrasts  influence  pop-out 
and  completion  of  partially  occluded  figures. 

yet  pool  signals  from  both  polarities  in  order  to 
build  effective  boundary  segmentations. 

(2)  How  human  observers  see  2-D  textures  pop- 
out  into  3-D  representations  that  greatly  influ¬ 
ence  the  discriminability  of  different  texture  re¬ 
gions  [Grossberg  and  Pessoa,  1997].  Of  partic¬ 
ular  interest  is  the  asymmetric  effect  of  back¬ 
ground  luminance  on  whether  colored  texture 
elements  are  discriminable,  an  important  issue 
for  the  design  of  effective  displays  that  contain 
multiple  sources  of  information.  This  project 
has  led  to  a  refinement  of  the  model’s  feedback 
loops  between  the  FCS  and  the  BCS  (pathways 
2  in  Figure  5).  These  pathways  ensure  that  3-D 
boundary  and  surface  representations  are  com¬ 
putationally  consistent,  even  though  they  are 
computed  by  circuits  that  obey  complementary 
processing  rules.  (3)  How  changes  in  the  the  ge¬ 
ometry  of  an  image  and  in  the  relative  contrasts 
of  its  regions  interact  through  the  cooperative 
and  competitive  processes  that  determine  how 
the  image  will  be  parsed  by  the  brain  into  sep¬ 
arate  objects  [Grossberg,  1997].  This  project 
has  clarified  how  the  brain  uses  the  image  T- 
junctions  at  which  occluding  objects  intersect 
occluded  objects  to  separate  them  from  one  an¬ 
other.  The  model  shows  how  this  can  be  done 
without  assuming  that  the  brain  contains  ex¬ 
plicit  T-junction  detectors.  Rather,  the  con¬ 
textual  balance  of  boundary  cooperation  and 
competition  strengthens  some  boundaries  while 
breaking  others.  In  particular,  the  boundary  of 
the  stem  of  the  T  gets  broken  from  its  top  as 
an  early  step  in  figure-ground  pop-out.  This 
explanation  clarifies  why  models  that  have  de¬ 
pended  upon  T-junction  detectors  to  explain 
figure-ground  pop-out  have  fallen  into  difficulty 
when  confronted  with  many  scenes. 


Figure  6  is  a  famous  demonstration  of  the  in¬ 
teraction  between  geometry  and  contrast  that 
is  due  to  Bregman  [1981]  and  Kanizsa  [1979]. 
This  demonstration  shows  that  the  existence  of 
a  fixed  set  of  T-junctions  in  an  image  does  not 
determine  how  it  will  be  parsed.  Instead,  re¬ 
versing  the  relative  contrasts  of  occluding  and 
occluded  objects,  without  changing  their  T- 
junctions,  can  greatly  influence  how  well  oc¬ 
cluded  objects  are  completed  and  recognized. 
Figure  6B  showing  the  limiting  case  in  which 
the  occluder  contrast  is  less  than  that  of  the 
occluded  gray  fragments,  indeed  equals  the  lu¬ 
minance  of  the  background,  and  thereby  greatly 
reduces  recognition  of  the  same  gray  fragments 
that  are  easily  grouped  and  recognized  as  let¬ 
ters  B  when  they  have  a  smaller  contrast  than 
the  black  occluder  in  Figure  6 A. 

5  ART  and  ARTMAP  Neural 
Networks  for  Applications: 

Self- Organizing  Learning, 
Recognition,  and  Prediction 

ART  and  ARTMAP  neural  networks  have  been 
applied  to  a  variety  of  problems,  illustrated  here 
by  satellite  remote  sensing,  radar  identification, 
and  medical  database  examples.  A  new  family 
of  distributed  ART  models  retain  stable  cod¬ 
ing,  recognition,  and  prediction  while  allowing 
arbitrarily  distributed  category  representation 
during  learning  and  performance. 

5.1  Technology  Transfer:  ART  and 
ARTMAP  Neural  Networks 

Researchers  at  the  Boston  University  Depart¬ 
ment  of  Cognitive  and  Neural  Systems/Center 
for  Adaptive  Systems  (CNS/CAS),  supported 
by  basic  research  funding  from  DARPA  and 
ONR,  have  introduced  and  analyzed  the  ART 
(Adaptive  Resonance  Theory)  family  of  neural 
network  architectures  for  self-organizing  cate¬ 
gory  learning,  recognition,  and  prediction.  Ca¬ 
pabilities  of  these  systems  include  stable  in¬ 
cremental  learning,  if-then  rule  extraction,  and 
database  interpretation.  In  analyzing  large  non¬ 
stationary  input  streams,  ART  systems  real¬ 
ize  a  combination  of  properties  that  are  not 
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shared  by  other  neural  network,  artificial  intel¬ 
ligence,  or  statistical  methods.  This  research 
program  is  now  advancing  state-of-the-art  en¬ 
gineering,  moving  from  neural  network  models 
to  application  prototypes  and  fielded  systems. 
ART  networks  are  being  used  for  airplane  de¬ 
sign  and  manufacturing,  adaptive  system  soft¬ 
ware,  circuit  design  speech  recognition,  financial 
prediction,  medical  imaging,  database  analysis, 
robotics,  and  defense  intelligence.  This  tech¬ 
nology  transfer  has  been  accelerated  by  con¬ 
sultations  between  development  engineers  and 
Boston  University  faculty  researchers,  as  well 
as  by  the  contributions  of  DARPA-  and  ONR- 
funded  Ph.D.  students  and  postdoctoral  fel¬ 
lows  and  by  CNS  graduates  who  are  imple¬ 
menting  ART  systems  as  part  of  their  com¬ 
mercial  and  government  employment.  Exam¬ 
ples  of  applications  that  have  been  published 
include:  a  Boeing  parts  design  retrieval 
system  [Caudell,  Smith,  Escobedo,  and  Ander¬ 
son,  1994];  an  autonomous  vision  system 
[Caudell  and  Healy,  1994];  robot  sensory- 
motor  control  [Bachelder  and  Waxman,  1994; 
Baloch  and  Waxman,  1991;  Bachelder,  Wax- 
man,  and  Seibert,  1993;  Dubrawski  and  Crow¬ 
ley,  1994a];  robot  navigation  [Dubrawski 
and  Crowley,  1994b;  Racz  and  Dubrawski, 

1995] ;  active  vision  [Srinivasa  and  Sharma, 

1996] ;  3-D  object  recognition  [Seibert  and 
Waxman,  1992];  face  recognition  [Seibert 
and  Waxman,  1993];  medical  imaging  [Soliz 
and  Donohoe,  1996];  satellite  remote  sens¬ 
ing  [Baraldi  and  Parmiggiani,  1995;  Copal, 
Sklarew,  and  Lambin,  1993];  Macintosh  op¬ 
erating  system  software  [Johnson,  1993]; 
automatic  target  recognition  [Bernardon 
and  Carrick,  1995;  Koch,  Moya,  Hostetler, 
and  Fogler,  1995;  Rubin,  1995;  Waxman  et 
al,  1995];  electrocardiogram  classification 
[Ham  and  Han,  1996;  Suzuki,  1995];  air  qual¬ 
ity  monitoring  [Wienke,  Xie,  and  Hopke, 

1994] ;  weather  prediction  [Soliz  and  Caudell, 
1996];  strength  prediction  for  concrete 
mixes  [Kasperkiewicz,  Racz,  and  Dubrawski, 

1995] ;  signature  verification  [Murshed,  Bor- 
tolozzi,  and  Sabourin,  1995];  decision  mak¬ 
ing  and  intelligent  agents  [Ruda  and  Snor- 
rason,  1996];  document  retrieval  [MacLeod 
and  Surkan,  1992;  Varma,  Woods,  and  Agogino, 


1996];  analysis  of  musical  scores  [Gjerdin- 
gen,  1990];  character  classification  [Gan  and 
Lua,  1992;  Kim,  Jung,  Kim,  and  Kim,  1995; 
Wang,  Xu,  and  Ziaoliang,  1992];  machine  con¬ 
dition  monitoring  and  failure  forecasting 
[Choi,  Ly,  Healy,  and  Smith,  1996;  Ly  and 
Choi,  1994;  Tarng,  Li,  and  Chen,  1994;  Tse 
and  Wang,  1996];  chemical  analysis  from 
UV  (ultraviolet)  and  IR  (infrared)  spec¬ 
tra  [Wienke,  1993,  1994];  multi-sensor  chem¬ 
ical  analysis  [Whiteley,  Davis,  Mehrotra,  and 
Ahalt,  1996];  combinatorial  optimization 
[Burke,  1994];  detection  of  cancerous  cells 
[Murshed,  Bortolozzi,  and  Sabourin,  1996]; 
sorting  of  recycled  materials  [Wienke  and 
Kateman,  1994];  frequency  selective  sur¬ 
face  design  for  electromagnetic  system 
devices  [Christodoulou,  Huang,  Georgiopou- 
los,  and  Liou,  1995];  and  digital  circuit  de¬ 
sign  [Kalkunte,  Kumar,  and  Patnaik,  1992]. 

Ongoing  basic  research  at  Boston  University 
continues  to  expand  capabilities  of  this  class  of 
systems  while  collaborative  projects  help  trans¬ 
fer  the  technology.  Some  of  this  work  will  now 
be  outlined. 

5.2  ART  and  ARTMAP  Neural 
Networks:  Introduction 

Adaptive  resonance  theory  originated  from  an 
analysis  of  human  cognitive  information  pro¬ 
cessing  and  stable  coding  in  a  complex  input 
environment  [Grossberg,  1976,  1980].  An  evolv¬ 
ing  series  of  ART  neural  network  models  have 
added  new  principles  to  the  early  theory  and 
have  realized  these  principles  as  quantitative 
systems  that  can  be  applied  to  problems  of 
category  learning,  recognition,  and  prediction. 
Each  ART  network  forms  stable  recognition  cat¬ 
egories  in  response  to  arbitrary  input  sequences 
with  either  fast  or  slow  learning  regimes.  The 
first  ART  model,  ART  1  [Carpenter  and  Gross¬ 
berg,  1987a],  was  an  unsupervised  learning  sys¬ 
tem  to  categorize  binary  input  patterns.  ART 
2  [Carpenter  and  Grossberg,  1987b]  and  fuzzy 
ART  [Carpenter,  Grossberg,  and  Rosen,  1991] 
extend  the  ART  1  domain  to  categorize  ana¬ 
log  as  well  as  binary  input  patterns  [Carpen¬ 
ter  and  Grossberg,  1991].  Supervised  ART 
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architectures,  called  ARTMAP  systems,  self- 
organize  arbitrary  mappings  from  input  vectors, 
representing  features  such  as  geospatial  spec¬ 
tral  values  and  terrain  variables,  to  output  vec¬ 
tors,  representing  predictions  such  as  vegeta¬ 
tion  classes  or  environmental  variables.  Inter¬ 
nal  ARTMAP  control  mechanisms  create  stable 
recognition  categories  of  optimal  size  by  maxi¬ 
mizing  code  compression  while  minimizing  pre¬ 
dictive  error  in  an  on-line  setting.  Binary  ART 
1  computations  are  the  foundation  of  the  first 
ARTMAP  network  [Carpenter,  Grossberg,  and 
Reynolds,  1991].  When  fuzzy  ART  replaces 
ART  1  in  an  ARTMAP  system,  the  resulting 
fuzzy  ARTMAP  architecture  [Carpenter,  Gross¬ 
berg,  Markuzon,  Reynolds,  and  Rosen,  1992] 
rapidly  learns  stable  mappings  between  analog 
or  binary  input  and  output  vectors. 

5.3  Match-Based  Learning  versus 
Error-Based  Learning 

A  match-based  learning  process  is  the  basis  of 
ART  stability.  Match-based  learning  allows 
memories  to  change  only  when  attended  por¬ 
tions  of  the  external  world  match  internal  ex¬ 
pectations,  or  when  something  completely  new 
occurs.  When  the  external  world  fails  to  match 
an  ART  network’s  expectations  or  predictions, 
a  search  process  selects  a  new  category,  rep¬ 
resenting  a  new  hypothesis  about  what  is  im¬ 
portant  in  the  present  environment.  Match- 
based  learning,  with  its  intrinsic  stability  fea¬ 
ture,  makes  ART  and  ARTMAP  well  suited  to 
problems  that  require  on-line  learning  of  a  large 
and  evolving  database.  On  the  other  hand, 
error-based  learning  is  more  naturally  suited  to 
other  classes  of  problems,  such  as  the  learn¬ 
ing  of  sensory-motor  maps,  that  require  slow 
adaptation  to  statistical  averages  rather  than 
the  construction  of  a  complex  knowledge  sys¬ 
tem.  Error-based  learning  responds  to  a  mis¬ 
match  by  sending  the  difference  between  a  tar¬ 
get  output  and  an  actual  output  toward  zero, 
rather  than  by  initiating  a  search  for  a  bet¬ 
ter  match.  Neural  networks  that  employ  error- 
based  learning  include  multi-layer  perceptrons 
[Rosenblatt,  1958,  1962]  such  as  back  propaga¬ 
tion  [Rumelhart,  Hinton,  and  Williams,  1986; 


Werbos,  1974].  ART  and  ARTMAP  networks 
feature  winner-take-all  (WTA)  competitive  cod¬ 
ing,  which  groups  inputs  into  disjoint  recogni¬ 
tion  categories.  Other  neural  network  learning 
systems  such  as  back  propagation  feature  dis¬ 
tributed  coding,  which  can  provide  good  noise 
tolerance  and  code  compression  but  which  typi¬ 
cally  requires  slow  learning.  Fast  learning  tends 
to  cause  catastrophic  forgetting  in  these  net¬ 
works,  as  it  does  in  ART  and  ARTMAP  net¬ 
works  in  which  the  code  representation  is  dis¬ 
tributed.  On  the  other  hand,  fast  learning  is 
often  desirable  for  on-line  adaptation  to  rapidly 
changing  circumstances  and  for  encoding  of  rare 
cases  and  large  databases.  Variants  of  the  basic 
ART  and  ARTMAP  networks  can  acquire  some 
of  the  advantages  of  distributed  coding  while 
maintaining  fast  learning  capability.  For  exam¬ 
ple,  ART-EMAP  uses  WTA  codes  for  learning 
and  distributed  codes  for  testing.  Distributed 
prediction  can  significantly  improve  ARTMAP 
performance,  especially  when  the  size  of  the 
training  set  is  small  [Carpenter  and  Ross,  1993, 
1995;  Rubin,  1995].  In  medical  database  pre¬ 
diction  problems,  which  often  feature  incon¬ 
sistent  training  input  predictions,  ARTMAP- 
IC  (instance  counting)  [Carpenter  and  Marku¬ 
zon,  1996]  improves  performance  with  a  com¬ 
bination  of  distributed  prediction,  category  in¬ 
stance  counting,  and  a  new  match  tracking 
search  algorithm.  A  voting  strategy  further  im¬ 
proves  prediction  by  training  the  system  sev¬ 
eral  times  on  different  orderings  of  an  input 
set.  Voting,  instance  counting,  and  distributed 
representations  combine  to  form  confidence  es¬ 
timates  for  competing  predictions.  However, 
since  these  and  most  other  ART  and  ARTMAP 
variants  use  WTA  coding  during  learning,  they 
do  not  solve  problems  such  as  category  prolifer¬ 
ation  with  noisy  training  sets,  unless  learning  is 
slow.  A  new  class  of  ART  and  ARTMAP  net¬ 
works  permit  fast  distributed  learning  as  well  as 
performance.  These  distributed  ART  (dART) 
and  distributed  ARTMAP  (d ARTMAP)  sys¬ 
tems  [Carpenter,  1997]  are  now  being  analyzed 
and  developed  for  future  applications.  The  fol¬ 
lowing  sections  describe  these  developments. 
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5.4  ARTMAP  Architecture 

ARTMAP  networks  self-organize  mappings 
from  input  vectors,  representing  features  such  as 
patient  history  and  test  results,  to  output  vec¬ 
tors,  representing  predictions  such  as  the  likeli¬ 
hood  of  an  adverse  outcome  following  a  proce¬ 
dure.  Fuzzy  ARTMAP  incorporates  two  fuzzy 
ART  modules,  ARTq  and  ART;,,  that  are  linked 
by  a  map  field  F°'^.  Many  applications  of  super¬ 
vised  learning  systems  such  as  ARTMAP  are 
classification  problems,  where  the  trained  sys¬ 
tem  tries  to  predict  a  correct  category  given  a 
test  set  input  vector.  A  prediction  might  be  a 
single  category  or  distributed  as  a  set  of  scores 
or  probabilities.  A  fuzzy  ARTMAP  algorithm, 
publicly  available  from  the  CNS  department, 
outlines  a  procedure  for  these  problems,  which 
do  not  require  the  full  architecture.  The  algo¬ 
rithm  implements  a  fuzzy  ARTMAP  network 
that  is  a  simplified  version  of  the  full  network 
but  that  nevertheless  is  sufficient  for  most  cur¬ 
rent  applications  (Figure  7). 

During  supervised  learning,  ART^  receives  a 
stream  of  patterns  }  and  ART;,  receives  a 
stream  of  patterns  where  is  the  cor¬ 

rect  prediction  given  An  associative  learn¬ 
ing  network  and  an  internal  controller  link  these 
modules  to  make  the  ARTMAP  system  operate 
in  real  time.  The  controller  creates  the  mini¬ 
mal  number  of  ARTu  recognition  categories,  or 
“hidden  units”,  needed  to  meet  accuracy  crite¬ 
ria.  A  minimax  learning  rule  enables  ARTMAP 
to  learn  quickly,  efficiently,  and  accurately  as  it 
conjointly  minimizes  predictive  error  and  maxi¬ 
mizes  code  compression.  This  scheme  automat¬ 
ically  links  predictive  success  to  category  size  on 
a  trial-by-trial  basis  using  only  local  operations. 
It  works  by  increasing  the  ARTa  vigilance  pa¬ 
rameter  pa  by  the  minimal  amount  needed  to 
correct  a  predictive  error  at  ART;,. 

At  the  map  field  an  ARTMAP  network  forms 
associations  between  categories  via  outstar 
learning  and  triggers  search,  via  a  match  track¬ 
ing  rule,  when  a  training  set  input  fails  to  make 
a  correct  prediction.  Match  tracking  increases 
the  ARTq  vigilance  parameter  pa  in  response 
to  a  predictive  error  at  ART^.  A  baseline  vigi¬ 
lance  parameter  Pa  calibrates  a  minimum  confi¬ 


dence  level  at  which  ARTq  will  accept  a  chosen 
category.  Lower  values  of  Pa  allow  larger  cat¬ 
egories  to  form,  maximizing  code  compression. 
Initially,  pa  =  Pa-  During  training,  a  predictive 
failure  at  ART^  increases  pa  just  enough  to  trig¬ 
ger  an  ARTq  search.  Match  tracking  sacrifices 
the  minimum  amount  of  compression  necessary 
to  correct  the  predictive  error.  Hypothesis  test¬ 
ing  selects  a  new  ART  category,  which  focuses 
attention  on  a  cluster  of  input  features  that 
is  better  able  to  predict  b^").  With  fast  learn¬ 
ing,  match  tracking  allows  a  single  ARTMAP 
system  to  learn  a  different  prediction  for  a  rare 
event  than  for  a  cloud  of  similar  frequent  events 
in  which  it  is  embedded. 

5.5  Geospatial  Mapping  from 
Satellite  Remote  Sensing  Data 

Mapping  vegetation  from  satellite  remote  sens¬ 
ing  data  has  been  an  active  area  of  research  and 
development  over  a  twenty  year  period  [HoflFer 
et  ai,  1975;  Strahler,  Logan,  and  Bryant,  1978]. 
Recently,  a  new  ARTMAP-based  methodology 
for  automatic  mapping  from  Landsat  Thematic 
Mapper  (TM)  and  terrain  data  was  developed 
to  solve  challenging  remote  sensing  classifica¬ 
tion  problems,  using  a  prototype  data  set  that 
predicted  vegetation  classification  from  spectral 
and  terrain  features  [Carpenter,  Gjaja,  Gopal, 
and  Woodcock,  1997].  After  training  at  the 
pixel  level,  system  capabilities  were  tested  at 
the  stand  level  in  regions  not  seen  during  train¬ 
ing.  ARTMAP  learning,  being  fast,  stable,  and 
scalable,  overcame  common  limitations  of  back 
propagation,  which  did  not  give  satisfactory 
performance  on  this  problem.  Best  results  were 
obtained  using  a  hybrid  system  based  on  a  con¬ 
vex  combination  of  fuzzy  ARTMAP  and  maxi¬ 
mum  likelihood  predictions.  A  voting  strategy 
improved  prediction  by  training  the  system  sev¬ 
eral  times  on  different  orderings  of  an  input  set. 
Voting  also  assigns  confidence  estimates  to  com¬ 
peting  predictions. 

5.6  ARTMAP-IC  Applied  to  a 
Medical  Prediction  Problem 

Automated  medical  diagnosis  incorporates 
many  of  the  most  challenging  problems  that  are 
intrinsic  to  large-scale  database  analysis  in  gen- 
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Figure  7:  Fuzzy  ART  embedded  in  a  simplified  ARTMAP  network.  In  the  fuzzy  ART  algorithm, 
Wj  denotes  both  the  bottom-up  weight  vector  and  the  top-down  weight  vector,  with  Wij  =  Wji. 
The  ARTMAP  network  computes  classification  probabilities,  with  jbj  =  1  at  an  output  field  Fq. 


eral.  Working  with  these  problems  has  stim¬ 
ulated  a  number  of  ART  architecture  devel¬ 
opments  in  recent  years.  One  such  system, 
ART-EMAP  (evidence  MAP),  improves  perfor¬ 
mance  in  noisy  or  ambiguous  input  environ¬ 
ments  by  adding  to  the  basic  ARTMAP  system 
distributed  spatial  and  temporal  evidence  accu¬ 
mulation  processes,  in  four  incremental  stages 
[Carpenter  and  Ross,  1993,  1995].  Stage  1  dis¬ 
tributes  activity  across  category  representations 
during  performance.  In  a  variety  of  studies, 
this  device  improves  test-set  predictive  accuracy 
compared  to  ARTMAP,  which  is  the  same  net¬ 
work  except  with  category  choice  during  testing 
as  well  as  training.  Distributed  test-set  cate¬ 
gory  activation  improves  performance  accuracy 
on  various  medical  database  simulations.  Fur¬ 
ther  improvement  is  achieved  by  the  ARTMAP- 
IC  neural  network  [Carpenter  and  Markuzon, 
1996],  which  adds  category  instance  counting 
and  a  new  search  algorithm  to  ART-EMAP  dis¬ 
tributed  prediction.  Instance  counting  weights 
distributed  predictions  according  to  the  number 
of  training  set  inputs  placed  in  each  category.  A 
new  version  of  the  ARTMAP  match  tracking  al¬ 
gorithm,  which  controls  search  following  a  pre¬ 
dictive  error,  facilitates  prediction  with  sparse 
or  inconsistent  data.  Compared  to  the  origi¬ 
nal  match  tracking  rule  (MT-I-),  the  new  algo¬ 
rithm  (MT-)  further  compresses  memory  with¬ 
out  loss  of  accuracy.  Simulations  that  illustrate 


ARTMAP-IC  performance  on  the  Pima  Indian 
Diabetes  (PID)  medical  database  are  summa¬ 
rized  below.  Results  for  ARTMAP-IC  compare 
favorably  to  those  of  logistic  regression,  K  near¬ 
est  neighbor  (KNN),  and  the  perceptron  net¬ 
work  ADAP,  and  also  compared  to  the  basic 
ARTMAP  network  and  ART-EMAP. 

5.7  Comparative  Simulations  on  the 
Pima  Indian  Diabetes  Database 

The  PID  data  set  [Smith,  Everhart,  Dickson, 
Knowler,  and  Johannes,  1988]  was  obtained 
from  the  UCI  repository  of  machine  learn¬ 
ing  databases  [Murphy  and  Aha,  1992].  The 
database  task  is  to  predict  whether  a  patient 
will  develop  diabetes,  based  on  eight  clinical 
findings;  age,  the  diabetes  pedigree  function, 
body  mass,  2-hour  serum  insulin,  triceps  skin 
fold  thickness,  diastolic  blood  pressure,  plasma 
glucose  concentration,  and  number  of  pregnan¬ 
cies.  Each  patient  represented  in  the  database 
is  a  female  of  Pima  Indian  heritage  who  is  at 
least  21  years  old.  Smith  et  al.  used  the  PID 
data  set  to  evaluate  the  ADAPtive  learning  rou¬ 
tine  (ADAP)  [Smith,  1962],  a  type  of  percep¬ 
tron  [Rosenblatt,  1958,  1962].  This  study  had 
576  cases  in  the  training  set  and  192  cases  in 
the  test  set,  and  comparative  simulations  keep 
the  same  training  and  test  sets.  About  39.9% 
of  patients  in  the  sample  developed  diabetes. 

Table  3  compares  ADAP  test  set  performance 
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Model 

Correct 

predictions 

C-index 

Compression 

factor 

logistic  regression 

77% 

0.84 

— 

ADAP 

76% 

— 

— 

ARTMAP  (K=l) 
[MT+:  £=+0.0001] 

66% 

0.76 

9.3 

Q  =  15 

12<Q<19 

Peak  % 

[C-index,  Q  ] 

Compression 

KNN 

77% 

76-77% 

77% 

[0.80,  Q=13-15] 

1 

ART-EMAP 
[MT+:  £=+0.0001] 

76% 

76-78% 

78% 

[0.87,  Q=13] 

9.3 

ARTMAP-IC 
[MT+;  £=+0.0001] 

79% 

79-80% 

80% 

[0.87,  Q=9-13] 

9.3 

Q  =  15 

13<Q<17 

ARTMAP-IC 
[MT-:  £=-0.0001] 

81% 

80-81% 

81% 

[0.88,  Q=15] 

9.3 

Q  =  11 

8<Q<  14 

ARTMAP-IC 
[MT-:  £=-0.01] 

79% 

78-81% 

81% 

[0.87,  Q=9] 

12.8 

Table  3:  Pima  Indian  diabetes  (PID)  simulations. 


with  that  of  logistic  regression,  KNN,  and  three 
ARTMAP  networks.  ARTMAP-IC  uses  the  in¬ 
stance  counting  rule  and  a  Q-max  rule,  for  dis¬ 
tributed  prediction  using  the  Q  category  nodes 
that  receive  maximal  input.  Comparative  sim¬ 
ulations  show  results  for  ART-EMAP  (Stage 
1),  which  is  equivalent  to  ARTMAP-IC  with¬ 
out  instance  counting;  and  for  basic  ARTMAP, 
which  sets  for  category  choice  during  testing. 
On  average,  the  various  ARTMAP  networks, 
which  share  a  common  training  regime,  have  62 
committed  category  nodes.  With  two  output 
classes,  an  a  priori  rule-of-thumb  estimate  for 
the  size  of  distributed  category  representation 
sets  Q  =  15.  Table  3  shows  that  ARTMAP- 
IC  has  the  best  test  set  performance,  both  in 
terms  of  the  C-index  and  the  number  of  cor¬ 
rect  test  set  predictions.  MT-  with  parameter 
e  =  —0.01  compresses  memory  even  more,  re¬ 
ducing  the  number  of  committed  nodes  from  62 


to  45,  with  little  deterioration  in  predictive  ac¬ 
curacy.  Compared  to  KNN,  the  ARTMAP  net¬ 
works  compress  memory  by  a  factor  of  about 
10:1. 

5.8  Radar  Target  Recognition 

Radar  range  profiles  are  one-dimensional  im¬ 
ages  of  radar  targets  that  reflect  the  finite  travel 
time  of  an  electromagnetic  pulse  across  an  ob¬ 
ject  [Borden,  1993;  Hudson  and  Psaltis,  1993; 
Smith  and  Goggans,  1992].  For  recognition  of 
simulated  range  profiles,  fuzzy  ARTMAP  has 
been  shown  to  achieve  accuracy  comparable 
to  KNN  classifiers  but  with  significantly  bet¬ 
ter  code  compression.  For  automatic  target 
recognition  (ATR)  from  radar  range  profiles, 
ARTMAP-based  methods  have  been  developed 
for  improved  data  compression  and  familiarity 
discrimination,  as  follows. 
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5.9  Reset  Buffering  for  Code 
Compression 

Efficient  information  storage  is  critical  for  ap¬ 
plications  such  as  missile-borne  ATR  where 
space,  power,  and  processing  speed  restric¬ 
tions  are  severe.  A  new  fuzzy  ARTMAP  re¬ 
set  buffering  technique  [Grossberg,  Rubin,  and 
Streilein,  1996]  promises  even  greater  efficiency 
in  data  storage.  During  fast  learning,  the 
basic  ARTMAP  algorithm  does  not  compute 
the  cumulative  predictive  success  of  the  cate¬ 
gories  that  it  learns.  Slow  learning  and  other 
mechanisms  can  incorporate  some  statistical 
factors  into  the  trained  network  [Bradski  and 
Grossberg,  1995;  Carpenter,  Grossberg,  and 
Reynolds,  1995].  These  techniques  reduce  cate¬ 
gory  proliferation  in  response  to  data  sets  with 
high  noise  or  strongly  overlapping  probability 
distributions. 

ARTMAP  buffering  focuses  on  improving  code 
compression  in  response  to  high-dimensional  in¬ 
puts  with  largely  non-overlapping  underlying 
distributions,  as  these  are  the  type  of  distri¬ 
butions  occurring  in  range  profile  simulations. 
Since  it  is  the  fuzzy  ARTMAP  reset  process 
that  leads  to  the  generation  of  new  nodes,  mod¬ 
ifying  reset  to  be  sensitive  to  the  cumulative 
statistics  of  the  training  process  can  increase 
compression.  Each  time  a  node  leads  to  pre¬ 
dictive  success,  it  is  buffered  against  being  re¬ 
set.  This  operation  thus  uses  concepts  from  re¬ 
inforcement  learning  to  modulate  the  process 
of  recognition  learning  [Grossberg,  1982,  1987]. 
Several  variants  of  the  buffering  procedure  were 
tested  on  two  types  of  radar  range  profile  simu¬ 
lations.  One  type  of  simulation  involved  calcu¬ 
lation  of  single-scatter  radar  returns  from  scat¬ 
tering  centers  in  simulated  aircraft.  Simulations 
were  also  performed  using  xpatch,  a  sophisti¬ 
cated  electromagnetic-scattering  simulator  de¬ 
veloped  under  DoD  sponsorship  [Volakis,  1994]. 

Simulations  show  that  buffering  is  capable  of 
reducing  the  required  storage  down  to  its  the¬ 
oretical  minimum  with  little  or  no  loss  of  clas¬ 
sification  accuracy  (Table  4).  The  MT-  search 
algorithm  of  ARTMAP-IC  shows  similar  com¬ 
pression  results. 


5.10  Familiarity  Discrimination 

The  recognition  process  usually  involves  famil¬ 
iarity  discrimination  as  well  as  identification. 
Consider,  for  example,  a  neural  network  de¬ 
signed  to  identify  aircraft  based  on  their  radar 
reflections  and  trained  on  sample  reflections 
from  ten  types  of  aircraft.  After  training,  the 
network  should  correctly  classify  radar  reflec¬ 
tions  belonging  to  these  ten  familiar  classes,  but 
it  should  also  abstain  from  making  a  meaning¬ 
less  guess  when  presented  with  a  radar  reflection 
from  a  different,  unfamiliar  class  of  aircraft. 

ARTMAP-FD  (familiarity  discrimination)  is  an 
extension  of  fuzzy  ARTMAP  that  performs  fa¬ 
miliarity  discrimination.  During  testing,  an  in¬ 
put  pattern  A  is  defined  as  familiar  when  a  fa¬ 
miliarity  function  (f>{A)  is  greater  than  a  deci¬ 
sion  threshold  7.  If  4>{A)  >  7,  ARTMAP-FD 
predicts  an  output  class.  If  (^(A)  <  7,  A  is  re¬ 
garded  as  belonging  to  an  unfamiliar  class  and 
the  network  makes  no  prediction.  In  the  case  of 
a  test  set  sequence,  ARTMAP-FD  accumulates 
familiarity  measures  at  each  predicted  class  as 
the  sequence  is  presented.  Once  the  winning 
class  is  determined,  the  object’s  familiarity  is 
defined  as  the  average  accumulated  familiarity 
measure  of  that  class.  The  receiver  operating 
characteristic  (ROC)  formalism  can  be  used  to 
determine  the  threshold  7. 

Figure  8A  depicts  the  scattering  centers  of  a 
set  of  36  simulated  targets.  The  network  was 
trained  on  simulated  range  profiles  generated  by 
18  randomly  chosen  targets  (in  boxes),  which 
define  the  set  of  familiar  classes.  ROC  curves 
(Figure  8B)  were  obtained  from  simulated  mul¬ 
tiwavelength  (40  center  frequency)  range  pro¬ 
files  from  all  36  targets,  familiar  and  unfamil¬ 
iar.  Sequential  evidence  accumulation  was  per¬ 
formed  for  1,  3,  and  100  observations,  corre¬ 
sponding  to  0.05,  0.15,  and  5.0  seconds  of  obser¬ 
vation  time.  Classification  accuracy  for  familiar 
targets  was  89.5%,  97.0%,  and  100.0%,  with  the 
network  creating  44  category  nodes. 

5.11  Distributed  ART  and 
Distributed  ARTMAP 

Basic  research  continues  to  expand  computa¬ 
tional  capabilities  of  ART  and  ARTMAP  sys- 


simulation 

classifier 

minimum 
#  nodes 

without 

buffering 

with 

buffering 

accuracy 

#  nodes 

accuracy 

#  nodes 

scattering 

centers 

fuzzy  ARTMAP 

4 

56.3% 

18 

55.1% 

4 

xpatch 

(SNR=3) 

fuzzy  ARTMAP 

3 

79.7% 

10 

77.1% 

3 

scattering 

centers 

ART-EMAP 

36 

100.0% 

135 

100.0% 

36 

Table  4:  Buffering  reduces  data  storage  requirements  for  simulated  radar  range  profile  ATR  with 
minimal  loss  of  classification  accuracy. 


Figure  8:  (A)  Scattering  centers  for  simulation  of  radar  range  profiles  of  familiar  (boxed)  and 
unfamiliar  targets.  (B)  ROC  curves  for  discrimination  between  familiar  and  unfamiliar  targets 
using  simulated  radar  range  profiles. 


terns.  A  new  class  of  distributed  ART  mod¬ 
els  retain  stable  coding,  recognition,  and  pre¬ 
diction,  but  allow  arbitrarily  distributed  code 
representation  during  learning  as  well  as  perfor¬ 
mance  [Carpenter,  1997].  These  networks  auto¬ 
matically  apportion  learned  changes  according 
to  the  degree  of  activation  of  each  coding  node. 
This  permits  fast  as  well  as  slow  learning  with¬ 
out  catastrophic  forgetting.  Distributed  ART 
models  replace  the  traditional  neural  network 
path  weight  with  a  dynamic  weight  equal  to  the 
rectified  difference  between  coding  node  activa¬ 
tion  and  an  adaptive  threshold.  The  input  sig¬ 
nal  Tj  that  activates  the  distributed  code  is  a 
function  of  a  phasic  component  Sj,  which  de¬ 
pends  on  the  active  input,  and  a  tonic  compo¬ 
nent  Qj,  which  depends  on  prior  learning  but 
is  independent  of  the  current  input,  as  in  the 
fuzzy  ARTMAP  algorithm.  At  each  synapse, 
phasic  and  tonic  terms  balance  one  another  and 


exhibit  dual  computational  properties.  During 
learning  with  a  constant  input,  phasic  terms  are 
constant  while  tonic  terms  may  grow.  Tonic 
components  would  then  become  larger  for  all 
inputs,  but  phasic  components  would  become 
more  selective,  reducing  the  total  coding  signal 
sent  by  a  significantly  different  input  pattern. 
Dynamic  weights  that  project  to  coding  nodes 
obey  a  distributed  instar  leaning  law  and  those 
that  originate  from  coding  nodes  obey  a  dis¬ 
tributed  outstar  learning  law.  Inputs  activate 
distributed  codes  through  phasic  and  tonic  sig¬ 
nal  components  with  dual  computational  prop¬ 
erties,  and  a  parallel  distributed  match-reset- 
search  process  helps  stabilize  memory.  When 
the  code  is  winner-take-all,  the  unsupervised 
distributed  ART  model  (dART)  is  computation¬ 
ally  equivalent  to  fuzzy  ART  and  the  supervised 
distributed  ARTMAP  model  (dARTMAP)  is 
equivalent  to  fuzzy  ARTMAP.  With  fast  dis- 
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tributed  learning,  dART  and  dARTMAP  net¬ 
works  are  likely  to  expand  the  domain  of  appli¬ 
cations  of  the  ART  family  of  networks. 

6  Coherent  Processing  of  Moving 
Targets 

6.1  Motion  Capture,  Attention,  and 
Tracking 

When  an  object  moves  under  real-world  condi¬ 
tions,  aperture  ambiguity  and  image  or  detector 
noise  often  prevent  all  but  a  small  subset  of  its 
image  features,  such  as  its  bounding  contours, 
from  generating  unambiguous  motion  direction 
cues.  This  problem  is  faced  by  all  detection  sys¬ 
tems  that  use  spatially  limited  filters,  which  see 
the  world  through  the  “aperture”  defined  by 
the  filter’s  spatial  extent.  The  problem  is  ex- 
ascerbated  when  multiple  targets  are  all  mov¬ 
ing  in  different  directions  in  a  cluttered  scene 
under  rapidly  changing  lighting  conditions,  as 
often  happens  in  battlefield  scenarios.  Mo¬ 
tion  processing  models  which  depend  entirely 
on  bottom-up  filtering  mechanisms  break  down 
badly  in  such  situations.  A  process  of  mo¬ 
tion  capture  enables  the  brain  to  cope  with  this 
problem  even  in  response  to  cluttered,  scintil¬ 
lating  scenes.  This  project  is  developing  a  com¬ 
putational  model  of  the  motion  capture  process 
[Chey,  Grossberg,  and  Mingolla,  1997a,  1997b]. 
Such  a  model  facilitates  the  identification  and 
attentive  tracking  of  rapidly  moving  objects  un¬ 
der  noisy,  cluttered,  and  camouflaged  detection 
conditions. 

Motion  capture  is  the  process  whereby  ambigu¬ 
ous  motion  signals  that  are  distributed  across 
an  object  are  reorganized  into  a  coherent  rep¬ 
resentation  of  the  object’s  motion  direction  and 
speed.  For  example,  consider  the  task  of  rapidly 
detecting  a  leopard  leaping  from  a  jungle  branch 
under  a  sun-dappled  forest  canopy.  Consider 
how  spots  on  the  leopard’s  coat  move  as  its 
limbs  and  muscles  surge.  Imagine  how  the  pat¬ 
terns  of  light  and  shade  play  on  the  leopard’s 
coat  as  it  leaps  through  the  air.  These  lumi¬ 
nance  and  color  contours  move  across  the  leop¬ 
ard’s  body  in  a  variety  of  directions  that  do  not 
necessarily  point  in  the  direction  of  the  leop¬ 
ard’s  leap.  Instead,  the  leopard’s  body  gener¬ 


ates  a  scintillating  mosaic  of  moving  contours 
that  could  easily  prevent  its  detection.  Typi¬ 
cally,  only  the  bounding  contours  of  the  leop¬ 
ard’s  body  provide  unambiguous  cues  of  the 
leopard’s  true  direction  and  speed  of  motion. 
Motion  capture  rapidly  reorganizes  this  scintil¬ 
lating  mosaic  of  moving  light  and  shade  into  a 
coherent  object  percept  with  a  unitary  motion 
direction  and  speed.  The  leopard  as  a  whole 
then  seems  to  pop-out  from  the  jungle  back¬ 
ground  and  to  draw  our  attention. 

Motion  capture  seems  to  be  a  preattentive,  in 
particular,  an  automatic  bottom-up  visual  pro¬ 
cess.  A  striking  property  of  the  model  is  that 
the  same  circuit  mechanism  which  carries  out 
motion  capture  can  also  use  top-down  attention 
to  search  for  an  object  moving  in  a  prescribed 
direction.  This  attentional  priming  circuit  au¬ 
tomatically  suppresses  motion  signals  that  are 
not  in  the  desired  direction,  while  enhancing 
and  grouping  signals  that  are.  Surprisingly,  this 
turned  out  to  be  a  type  of  Adaptive  Resonance 
Theory  (ART)  circuit.  As  such,  it  suggests  how 
the  model  can  learn  through  visual  experience 
with  moving  targets  to  group  together  signals 
that  represent  the  same  motion  direction. 

Figure  9  shows  the  model  processing  stages  in 
schematic  form.  The  model  is  called  a  Motion 
Boundary  Contour  System  (mBCS)  because  it 
is  homologous  to  the  BCS  that  is  used  to  form 
boundary  representations  of  static  forms.  The 
static  form  processing  system  is  based  upon  ori- 
entationally  tuned  computations.  The  motion 
processing  system  is  based  upon  directionally 
tuned  computations.  The  parts  of  a  complex 
object  may  be  defined  by  many  oriented  compo¬ 
nents,  even  though  the  object  as  a  whole  moves 
in  a  single  direction.  The  mBCS  model  clari¬ 
fies  how  signals  from  multiple  orientations  are 
pooled  into  a  single  direction  of  motion. 

The  model  was  developed  through  our  efforts 
to  simulate  a  large  psychophysical  and  neurobi- 
ological  data  base  about  coherent  motion  per¬ 
ception.  The  neurobiological  data  concern  the 
processing  stream  that  joins  cortical  areas  VI, 
MT,  and  MST.  This  analysis  led  us  to  articu¬ 
late  the  following  five  design  principles  on  which 
the  model  is  based:  (1)  Unambiguous  feature 
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Figure  9:  mBCS  model  processing  stages. 
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Figure  10:  Motion  capture  develops  in  time. 
See  text  for  details. 


Figure  11:  Schematic  of  form-motion  fusion 
model. 


tracking  signals  are  used  to  capture  and  trans¬ 
form  ambiguous  motion  signals  into  coherent 
representations  of  object  motion.  (2)  A  single 
feature  tracking  process  can  contextually  select 
both  the  target  direction  and  speed.  (3)  Mo¬ 
tion  direction  and  speed  are  a  collective  prop¬ 
erty  of  a  multiple-scale  self-similar  filter  whose 
individual  scales  are  sensitive  to  different  speed 
ranges.  The  collective  output  from  these  scales 
provides  robust  motion  estimates  in  response  to 
noisy  data.  (4)  Nearby  contours  of  different  ori¬ 
entation  and  contrast  polarity  that  are  moving 
in  the  same  direction  cooperate  to  generate  a 
pooled  motion  direction  signal.  This  process  is 
modeled  by  a  spatially  long-range  motion  fil¬ 
ter.  This  long-range  filter  also  helps  to  solve 
another  important  problem;  namely,  it  enables 
the  model  to  track  spatially  and  temporally  dis¬ 
continuous  views  of  a  target  that  is  moving  with 
variable  speed  [Grossberg,  1997].  More  tradi¬ 
tional  tracking  algorithms  depend  upon  hav¬ 
ing  a  continuous  trajectory  through  time.  In 
the  brain,  this  tracking  process  leads  to  per¬ 
cepts  of  long-range  apparent  motion.  (5)  Mo¬ 
tion  capture  of  the  correct  direction  can  occur 
without  distorting  the  object  speed  estimate. 
This  design  constraint  leads  to  the  conclusion 
that  motion  capture  is  achieved  by  a  spatially 
long-range  grouping  network  of  ART  type.  This 
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grouping  network  operates  after  the  long-range 
spatial  filter.  Its  feedback  circuit  also  allows  at¬ 
tention  to  prime  a  desired  motion  direction. 

Figure  10  illustrates  motion  capture  for  the  sim¬ 
ple  case  of  a  bar  tilted  at  45  degrees  and  moving 
to  the  right.  Each  successive  frame  shows  the 
selected  motion  direction  signals  at  a  different 
time.  Each  line  is  proportional  to  the  maximal 
activity  within  that  time  frame  across  all  long- 
range  filter  scales  that  are  maximally  tuned  to 
that  line’s  direction.  Due  to  aperture  ambiguity, 
unambiguous  feature  tracking  signals  are  at  first 
computed  only  at  the  bar  ends.  The  favored 
motion  direction  in  the  bar’s  middle  is  perpen¬ 
dicular  to  its  orientation,  due  to  aperture  ambi¬ 
guity.  Motion  capture  enables  the  feature  track¬ 
ing  signals,  which  are  relatively  sparse  and  weak 
compared  to  the  ambiguous  aperture  signals,  to 
suppress  all  aperture  signals  that  are  not  com¬ 
patible  with  them.  In  the  bottom  frame,  the 
correct  direction  and  speed  of  the  bar  is  com¬ 
puted  everywhere  along  its  length. 

The  model  is  able  to  track  both  first-order  and 
second-order  motion  stimuli  [Baloch,  Gross- 
berg,  Mingolla,  and  Nogueira,  1997].  A  first- 
order  stimulus  is  a  stimulus  whose  motion  can 
be  discriminated  by  spatially  tracking  a  differ¬ 
ence  of  mean  luminance  or  color  over  time.  In 
second-order  motion  stimuli,  there  is  no  differ¬ 
ence  in  luminance  and  color  between  moving  re¬ 
gions,  but  the  spatial,  temporal,  or  ocular  distri¬ 
bution  of  mean  luminance  or  color  may  change 
through  time.  Illustrative  second-order  stim¬ 
uli  include  scintillating  objects  whose  mean  lu¬ 
minance  and  color  do  not  differ  from  that  of 
their  background.  Thus  the  model’s  filtering 
and  grouping  mechanisms  are  designed  to  work 
under  stimulus  conditions  that  may  include  var¬ 
ious  types  of  decoys  or  distractors. 

6.2  Fusion  of  Form  and  Motion 
Processing 

In  many  biological  and  technological  situa¬ 
tions,  an  object’s  representation  is  an  emergent 
properties  of  boundary  and  surface  completion 
mechanisms  within  the  form  processing  system. 
This  is  particularly  true  when  sensors  are  used 


that  are  noisy  and  include  many  missing  pixels, 
as  also  occurs  within  biological  retinas.  Under 
these  circumstances,  how  can  objects  be  tracked 
by  the  motion  processing  system,  given  that 
one  cannot  continuously  track  the  objects’  pix¬ 
els  from  one  image  frame  to  the  next?  Some 
type  of  form-motion  fusion  is  needed  here  to 
show  how  emergent  representations  of  objects 
within  the  orientationally-based  form  process¬ 
ing  stream  interact  with  the  directionally-based 
motion  processing  stream  to  generate  track- 
able  representations  of  objects  in  motion.  We 
have  been  developing  a  computational  model 
of  how  the  brain  accomplishes  form-motion  fu¬ 
sion  [Baloch  and  Grossberg,  1997;  Francis  and 
Grossberg,  1996],  and  have  used  it  to  explain 
many  data  about  how  the  brain  knows  how  to 
combined  successive  image  frames  into  motion 
trajectories,  even  when  one  cannot  continuously 
track  pixels  from  one  frame  to  the  next. 

A  number  of  authors  have  posited  that  such 
form-sensitive  motion  tracking  requires  pattern 
matching  and  geometry-based  parsing  of  ob¬ 
ject  representations  between  successive  image 
frames;  e.g.,  Tse,  Cavanagh,  and  Nakayama 
[1997].  We  have  shown  that  the  key  psychophys¬ 
ical  data  about  this  process  can  be  simulated 
simply  by  combining  our  models  of  boundary, 
surface,  motion,  and  attentional  processing  into 
a  larger  architecture.  In  other  words,  form- 
motion  fusion  seems  to  be  ’’nothing  but”  the 
collective  action  of  the  component  models  that 
have  been  reviewed  above.  A  schematic  of  this 
architecture  is  shown  in  Figure  11. 

At  what  processing  stage  should  the  link  be¬ 
tween  form  and  motion  processing  be  made?  It 
should  occur  after  the  form  processing  stage  at 
which  emergent  binocular  boundaries  are  gen¬ 
erated  and  before  the  long-range  motion  filter 
at  which  motion  signals  are  pooled  into  motion 
directions;  see  Figure  11.  This  interaction  also 
helps  to  generate  accurate  representations  of  a 
moving  object’s  depth  in  cases  where  range  de¬ 
tectors  are  not  used.  This  is  true  because  the 
form  processing  stream  uses  precisely  oriented 
binocular  matches  to  generate  precise  represen¬ 
tations  of  a  static  form’s  depth.  The  motion 
processing  system  loses  this  capability  by  pool¬ 
ing  multiple  orientations  into  a  single  motion 


direction.  The  form-motion  interaction  selec¬ 
tively  enhances  those  motion  computations  that 
are  consistent  with  the  depths  that  are  com¬ 
puted  within  the  form  system.  In  the  visual 
cortex,  this  larger  architecture  models  how  the 
form  processing  stream  between  cortical  areas 
VI,  V2,  and  V4  interacts  with  the  motion  pro¬ 
cessing  stream  between  cortical  axeas  VI,  MT, 
and  MST  via  a  cross-stream  interaction  from 
area  VI  to  MT. 

7  Psychophysical  Experiments  for 
Real-World  Tasks 

7.1  Visual  Search  Experiments  in 
Cluttered  Environments 

Much  effort  has  gone  into  developing  algorithms 
for  helping  human  observers  on  the  battlefield 
to  make  more  effective  decisions.  Surprisingly 
little  work  has  been  done,  however,  to  de¬ 
termine  the  types  of  perceptual  and  cognitive 
factors  that  can  improve  the  accuracy  of  hu¬ 
man  decision  makers  under  naturalistic  condi¬ 
tions.  Many  decision  making  tasks  involve  rapid 
search  through  cluttered  and  noisy  sources  of 
visual  information.  Many  visual  search  exper¬ 
iments  published  in  the  cognitive  science  lit¬ 
erature  use  small  numbers  of  isolated  targets 
and  distractors  with  simple,  2-D  forms.  On 
the  other  hand,  much  of  the  human  factors 
research  on  more  complex  imagery  is  so  tied 
to  measures  of  performance  on  specific  image 
paradigms  that  an  elucidation  of  fundamen¬ 
tal  limits  of  human  mechanisms  is  difficult  to 
achieve. 

To  more  closely  approximate  natural  scenes 
than  is  typically  done  in  psychological  research, 
we  displayed  on  a  video  monitor  variable-sized 
rocks  in  formations  that  resembled  irregular 
stone  walls,  rendered  using  a  3D  modeling  pro¬ 
gram.  We  examined  visual  search  for  a  fore¬ 
ground  rock  resting  on  other  rocks.  Experimen¬ 
tal  factors  were  (1)  cast  shadows  (present  or  ab¬ 
sent)  and  (2)  occlusion  configuration  (amount  of 
contour  occluding  other  rocks  and  whether  or 
not  any  background  rock  was  visible  on  both 
sides  of  an  occluding  target  rock).  Typical 
scenes  contained  between  20-160  rocks.  In  a 


“target  present”  scene,  one  rock  was  closer  to 
the  observer  and  in  front  of  the  others;  in  a  tar¬ 
get  absent  scene,  all  rocks  were  approximately 
the  same  distance  from  the  observer.  Illumina¬ 
tion  was  modeled  so  as  to  ensure  soft  shadows 
on  shadow-present  scenes  by  a  pair  of  point  light 
sources:  a  faint  source  in  line  with  the  viewing 
direction,  and  a  bright  source  from  above  and 
slightly  to  the  right  of  the  line  of  sight. 

A  psychophysical  display  paradigm  for  para¬ 
metrically  varying  complex  variables  of  illu¬ 
mination  and  shading  in  a  visual  search  task 
has  been  developed.  Certain  published  find¬ 
ings  based  on  search  in  scenes  of  simple  poly¬ 
hedral  objects,  including  asymmetries  in  search 
speed  for  illumination  from  above  vs.  below, 
have  been  shown  not  to  generalize  to  com¬ 
plex,  smoothly-shaded  images.  Additional  re¬ 
sults  suggest  useful  ways  of  manipulating  high¬ 
light  and  shadow  information  in  rendered  im¬ 
agery  to  encourage  the  “pop-out”  of  targets  in 
clutter.  A  preliminary  report  appears  in  Cun¬ 
ningham  et  al.  [1996]. 

7.2  Experiments  on  the  Perceived 

Segregation  of 

Element- Arrangement  Textures 

Recent  progress  in  the  generation  of  color- 
fused  imagery  from  artificial  sensors,  operat¬ 
ing  in  the  visibile-near-infrared  (low-light  CCD) 
and  thermal  infrared  bands  (Gen  III  intensi- 
fier  tubes),  indicates  that  appropriate  combi¬ 
nations  of  wavelengths  used  to  render  imagery 
may  make  an  operator’s  task  of  detecting  tar¬ 
gets  in  clutter  significantly  easier  than  would 
be  the  case  viewing  gray-scale  imagery  [Wax- 
man  et  ai,  1997].  The  fusion  algorithms  de¬ 
veloped  by  Waxman’s  group  are  based  in  part 
on  a  foundation  of  modeling  and  algorithm  de¬ 
velopment  by  our  own  group.  Development  of 
these  models  and  algorithms,  in  turn  have  been 
based  on,  among  other  constraints,  analysis  of 
human  psychophysical  experiments  conduct  by 
our  own  group  and  by  others.  Thus,  a  cycle 
of  experiment  measurement,  theoretical  model¬ 
ing  and  simulation,  followed  by  development  of 
applications  can  be  closed  to  further  advance 
all  three  endeavors.  Additional  psychophysical 
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work,  described  next,  is  being  targeted  to  refine 
our  understanding  of  human  mechanisms  for 
segmentation  of  chromatic  images,  which  will 
contribute  to  the  development  of  still  more  so¬ 
phisticated  algorithms  and  hardware  for  color- 
fused  imagery. 

The  fundamental  visual  mechanisms  involved  in 
such  visual  tasks  are  as  yet  poorly  understood, 
however,  and  quantitative  experiments  in  chro¬ 
matic  texture  segregation  ought  to  pay  direct 
dividends  in  helping  to  select  parameters  for 
rendering  of  color-fused  imagery. 

Many  experiments  have  used  achromatic 
element-arrangement  patterns  to  explore  the 
mechanisms  involved  in  texture  segregation. 
Element-arrangement  patterns  are  composed  of 
two  types  of  elements  arranged  in  alternating 
vertical  stripes  in  the  top  and  bottom  regions 
and  in  a  checkerboard  pattern  in  the  middle  re¬ 
gion.  Only  recently  has  work  been  done  with 
chromatic  element-arrangement  patterns  and 
early  results  suggest  that  different  mechanisms 
are  involved.  We  undertook  a  series  of  experi¬ 
ments  to  determine  the  factors  that  are  primar¬ 
ily  responsible  for  the  perceived  segregation  of 
chromatic  element-arrangement  patterns.  We 
investigated  the  effects  of  hue,  spatial  scale,  and 
background  luminance  on  the  segregation  of  the 
element-arrangement  patterns.  Hue  similarity, 
as  rated  by  subjects  in  a  separate  procedure, 
was  a  relatively  weak  factor  for  predicting  per¬ 
ceived  segregation.  The  effects  of  brightness 
differences  and  luminance  differences  interacted 
with  background  luminance  and  spatial  scale. 
Perceived  segregation  was  stronger  with  a  black 
background  than  with  a  white  background  and 
stronger  for  higher  spatial  frequencies. 

An  equation  based  on  differences  in  color- 
opponent  channels  has  been  developed  that  ac¬ 
curately  predicts  the  perceived  segregation  of 
chromatic  element-arrangement  patterns.  A 
theory  that  accounts  for  the  background  lumi¬ 
nance  affecting  chromatic  and  achromatic  pat¬ 
terns  difllerently  has  been  developed  and  sup¬ 
ported  by  experimental  findings.  The  exper¬ 
imental  results  suggest  that  perceived  segre¬ 
gation  of  chromatic  element-arrangement  pat¬ 
terns  is  largely  a  function  of  cone  contrast  and 


inhibitory 


Figure  12:  (A)  Model  of  saccadic  learning 
and  movement  control.  (B)  Anatomical  corre¬ 
lates:  SC=superior  colliculus;  SNr=substantia 
nigra,  PPRF=paramedian  pontine  reticular  for¬ 
mation,  MRF=mesencephalic  reticular  forma¬ 
tion. 


perceived  segregation  of  achromatic  element- 
arrangement  patterns  is  largely  a  function  of  the 
contrast  ratio  of  the  squares. 

8  Adaptive  Multimodal  Fusion  of 
Eye  Movement  Commands 

This  project  investigated  how  multiple  sources 
of  information  that  are  computed  in  different 
coordinate  frames  can  learn  in  real-time  to  fuse 
their  information  into  a  common  movement 
map.  All  the  movement  constraints  can  then 
compete  for  attention  on  the  map  and  select  a 
movement  target.  Such  an  algorithm  is  useful 
in  tracking  systems  wherein  changes  in  sensor 
properties  due  to  use  in  the  field  may  require 
a  self-calibrated  adjustment  of  decision-making 
parameters  in  order  to  maintain  accurate  tar- 
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get  tracking.  The  neurobiological  insights  about 
how  the  brain  allocates  attention  to  movement 
targets  may  also  help  to  design  more  effective 
displays  for  pilots  and  other  users  of  visually 
formatted  information. 

The  project  modeled  how  this  type  of  adap¬ 
tive  fusion  controls  the  rapid  ballistic  move¬ 
ments  that  our  eyes  make  when  we  look  around 
the  world  [Grossberg,  Roberts,  Aguilar,  and 
Bullock,  1997].  These  ballistic  movements  are 
called  saccades.  We  modeled  how  the  sac¬ 
cadic  movement  system  selects  a  target  when 
visual,  auditory,  and  planned  movement  com¬ 
mands  differ.  In  particular,  visual  signals  are 
computed  in  retinal,  or  camera-centered,  coor¬ 
dinates,  whereas  auditory  signals  are  computed 
in  head-centered  coordinates.  Much  evidence 
suggests  that  saccadic  commands  are  computed 
in  motor  error  coordinates.  How  do  all  these 
coordinate  systems  get  consistently  calibrated 
through  learning,  and  how  do  they  interact  to 
select  a  movement  command  from  all  the  tar¬ 
gets  that  may  be  available  at  any  time? 

Recent  neurobiological  data  suggest  that  this 
sort  of  multimodal  data  fusion  takes  place  in  the 
deep  layers  of  the  superior  colliculus  (SC).  The 
model  suggests  how  auditory,  and  planned  sac¬ 
cadic  target  positions  become  aligned  and  com¬ 
pete  with  visually  detected  target  positions  to 
select  a  movement  command  (Figure  12).  For 
this  to  occur,  visual  targets  are  transformed 
from  retinotopic  to  motor  error  coordinates,  and 
a  transformation  between  auditory  and  planned 
head-centered  representations  and  this  motor 
error  representation  is  learned.  The  model  sim¬ 
ulates  recent  neurophysiological  data  recorded 
from  identified  cells  within  the  deep  layers  of  cat 
and  monkey  SC.  These  cells  are  of  great  func¬ 
tional  interest  because  one  cell  type  (the  peak 
decay,  or  burst  cell)  is  predicted  to  be  a  source  of 
movement  and  learning  signals  before  the  mul¬ 
timodal  map  gets  learned.  The  other  cell  type 
(the  traveling  wave,  or  buildup  cell)  is  proposed 
to  receive  the  peak  decay  teaching  signals,  as 
well  as  the  multimodal  auditory  and  planned 
movement  signals  with  which  they  are  associ¬ 
ated  in  the  movement  map.  In  fact,  NMDA 
receptors,  which  are  known  to  be  involved  in 
brain  learning,  have  recently  been  discovered  in 


this  part  of  the  SC.  After  the  map  is  learned, 
both  cell  layers  use  same  circuit  that  controls 
the  learning  process  to  focus  attention  and  se¬ 
lect  a  movement  target.  The  model  also  func¬ 
tionally  interprets  the  saccade-related  behaviors 
of  cells  in  many  of  the  areas  that  interact  with 
SC  in  making  a  movement  decision,  including 
the  frontal  eye  fields,  parietal  cortex,  messen- 
cephalic  reticular  formation,  paramedian  pon¬ 
tine  reticular  formation,  and  substantia  nigra 
pars  reticulata  (see  Figure  12). 

9  Head  Mounted  Space- Variant 

Active  Vision  System:  Algorithms 
and  Hardware 

The  goal  of  this  project  is  to  construct  a  minia¬ 
ture  space-variant  active  vision  system  which  is 
to  be  demonstrated  as  a  prosthetic  device  for 
the  blind.  Other  related  applications  of  this 
technology  to  head  mounted  miniature  active 
vision  systems,  for  example  using  infra-red  sen¬ 
sors,  night  vision  applications,  etc.  are  expected 
to  also  benefit  from  this  technology.  The  bench¬ 
mark  tasks  for  the  computer  vision  system  are 
to  perform  a  variety  of  pattern  recognition,  nav¬ 
igation,  and  obstacle  avoidance  tasks.  The  man- 
machine  interface  will  be  provided  via  auditory 
cues  in  the  form  of  virtual  (i.e.,  spatially  de¬ 
fined)  audio  cues,  and  other  auditory  stimuli. 

The  motivation  for  this  project  is  to  harness 
the  large  potential  reduction  in  computational 
complexity  provided  by  the  architecture  of  the 
higher  vertebrate  and  human  visual  systems, 
which  are  in  all  cases  strongly  space- variant. 
Previous  work  has  shown  that  it  is  possible  to 
achieve  two-four  orders  of  magnitude  in  reduc¬ 
tion  in  space-complexity  in  unconstrained  wide- 
field  machine  vision  applications  via  the  use  of 
space- variant  architectures  such  as  the  log-polar 
image  format. 

Space-variant  active  vision  systems  are  ideal  as 
an  architectural  basis  for  the  construction  of 
miniature  computer  vision  systems,  but  a  num¬ 
ber  of  difficult  problems  are  associated  with  ex¬ 
ploiting  this  feature  of  biological  vision.  These 
include  the  need  to  build  miniature  camera  sys¬ 
tems  [Engel  et  al,  1994],  develop  control  algo- 
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rithms  [Greve,  Engel,  and  Schwartz,  1997],  and 
most  significantly,  to  develop  the  basic  image 
processing  algorithms  that  are  applicable  in  this 
domain. 

In  the  first  year  of  this  project,  our  goal  was  to 
construct  a  hardware  system  suitable  for  proto¬ 
typing  the  algorithms  for  the  project,  and  to  be¬ 
gin  design  of  the  algorithms.  This  initial  hard¬ 
ware  system  has  been  built  from  “off-the-shelf’ 
parts,  and  provides  a  much  larger  degree  of  com¬ 
putational  support  than  is  envisioned  for  the 
final  embedded  version  of  this  system.  This  ex¬ 
tra  functionality  (large  hard  disk,  Pentium  PC 
development  environment,  large  memory,  etc.) 
is  intended  to  facilitate  the  early  development 
stages  of  the  project.  The  platform  is  based  on 
the  use  of  a  quad  SHARC  digital  signal  pro¬ 
cessing  system,  providing  160  MFLOPS  on  four 
Analog  Devices  2106x  processors,  hosted  by  a 
Pentium  PC,  and  deploying  a  miniature  space- 
variant  active  vision  system. 

The  principal  results  of  the  past  year  have  been 
two  significant  advances  in  basic  algorithms  for 
space-variant  active  vision.  One  of  the  most 
serious  blocks  to  the  development  of  computer 
vision  systems  based  on  space- variant  active  vi¬ 
sion  has  been  the  lack  of  a  general  image  pro¬ 
cessing  toolkit  for  the  difficult  image  format 
provided  by  the  log-polar  (and  related)  space- 
variant  sensor  formats.  During  the  past  year,  we 
have  completed  development  of  an  “exponential 
chirp  algorithm”  which  provides  the  equivalent 
of  the  2-D  Fourier  Transform  (FFT),  but  which 
works  on  the  log-polar  image  format.  The  ex¬ 
ponential  chirp  provides  a  form  of  “quasi-shift 
invariance” ,  and  we  have  demonstrated  its  suc¬ 
cessful  use  in  pattern  matching  tasks.  Of  signif¬ 
icance  is  the  fact  the  we  have  developed  a  fast 
exponential  chirp  algorithm  which  has  identi¬ 
cal  complexity  as  the  conventional  2-D  FFT. 
Since  the  principal  motivation  for  using  log- 
polar  image  formats  is  that  they  provide  image 
sizes  which  are  two  to  four  orders  of  magnitude 
smaller  than  the  corresponding  constant  resolu¬ 
tion  image  input,  there  is  a  large  improvement 
in  processing  speed.  Since  the  exponential  chirp 
can  be  used,  as  is  the  case  for  the  FFT,  for  all  as¬ 
pects  of  image  processing  from  early  vision  (e.g. 
filtering)  to  “late”  vision  (e.g.  pattern  recogni¬ 


tion),  we  believe  that  this  represents  a  funda¬ 
mental  advance  in  this  area.  We  have  demon¬ 
strated  a  30  frame/sec  processing  rate  (on  a  180 
MHz  Pentium  P-6)  using  these  methods. 

A  second  major  result  has  been  the  develop¬ 
ment  of  fast  “anisotropic  diffusion”  methods 
for  image  segmentation.  Image  enhancement 
and  segmentation  methods  based  on  the  imple¬ 
mentation  of  partial  differential  equation  based 
“nonlinear”  diffusion  methods  provide  signifi¬ 
cantly  better  edge  enhancement  and  segmenta¬ 
tion  than  simple  Laplacian  or  isotropic  based 
diffusion  techniques.  The  problem  with  exploit¬ 
ing  these  superior  results  in  machine  vision  sys¬ 
tems  has  been  the  extremely  large  computa¬ 
tional  load  created  by  solving  nonlinear  partial 
differential  equations  (PDEs)  on  each  image. 
We  have  found,  for  example,  that  on  a  Pentium 
P-6,  it  can  take  up  to  2  minutes  to  process  a 
single  image  frame  with  an  anisotropic  diffusion 
method.  By  combining  our  “fast”  methods  for 
anisotropic  diffusion  with  space-variant  image 
architecture,  we  have  achieved  30  frame/sec  real 
time  processing  rates  which  are  indistinguish¬ 
able  in  quality  from  the  conventional  nonlinear 
diffusion  results.  Our  original  work  on  nonlin¬ 
ear  diffusion  is  described  in  a  recent  series  of  pa¬ 
pers  [Fischl,  1996;  Fischl,  Cohen,  and  Schwartz, 
1997a,  1997b;  Fischl  and  Schwartz,  1996,  1997a, 
1997b,  1997c],  while  the  exponential  chirp  algo¬ 
rithm  is  described  in  Bonmassar  and  Schwartz 
[1996a,  1996b,  1996c].  In  this  section,  we  will 
briefly  outline  some  of  these  results. 

9.1  Real  Time  Adaptive 

Alternatives  to  Nonlinear 
Diffusion  in  Image  Enhancement 

Many  early  vision  systems  employ  some  type 
of  filtering  in  order  to  reduce  noise  and/or  en¬ 
hance  contrast  in  regions  which  correspond  to 
borders  between  different  objects  within  an  im¬ 
age.  The  logical  extreme  of  this  process  is  the 
creation  of  a  piecewise  constant  image  with  step 
discontinuities  at  region  boundaries.  This  goal 
is  unattainable  using  linear  filtering  techniques, 
as  noise  reduction  blurs  the  locations  of  bound¬ 
aries  between  regions,  sometimes  to  the  point  of 
fusing  them.  In  order  to  address  this  problem. 
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Perona  and  Malik  [1990]  introduced  a  nonlinear 
version  of  the  diffusion  equation  previously  used 
by  Koenderink  and  Hummel  [Hummel,  1986; 
Koenderink,  1984]  for  early  visual  processing. 
In  this  formulation,  image  intensity  is  treated  as 
a  conserved  quantity  and  allowed  to  diffuse  over 
time,  with  the  amount  of  diffusion  at  a  point 
being  inversely  related  to  the  magnitude  of  the 
intensity  gradient  at  that  location.  This  process 
produces  visually  impressive  results  in  terms  of 
the  creation  of  sharp  boundaries  separating  uni¬ 
form  regions  within  an  image,  but  is  computa¬ 
tionally  expensive  (see  ter  Haar  Romeny  [1994] 
or  Fischl  and  Schwartz  [1997a]  for  a  more  com¬ 
plete  discussion  of  these  issues).  Linear  diffu¬ 
sion  is  identified  with  Gaussian  filtering  because 
the  Gaussian  is  the  Green’s  Function  of  the  lin¬ 
ear  diffusion  equation  for  an  infinite  domain. 
Thus,  spatial  integration  with  Gaussian  kernels 
can  be  used  to  implement  linear  diffusion.  The 
nonlinear  anisotropic  diffusion  proposed  by  Per¬ 
ona  and  Malik  has  no  known  closed  form  solu¬ 
tion  analogous  to  the  Green’s  Function  solution 
of  the  linear  equation,  and  therefore  must  be 
integrated  numerically. 


4  =  V  •  (c(|  VI  |)V/)  (1) 

where  I  is  the  intensity  image,  c  is  a  diffusion 
coefficient,  It  is  the  partial  derivative  of  I  with 
respect  to  time,  and  A  is  the  Laplacian  operator 
with  respect  to  the  spatial  coordinates.  The 
solution  to  equation  (1)  can  be  written  in  terms 
of  the  Green’s  function^  of  the  system  as 

/OO  POO 

/  I{x',y',Q)G{x,x',y,y',t)dx'dy' 
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(2) 

where  l{x',y',0)  is  the  initial  image,  and  the 
Green’s  function  G{x,x',y,y',t)  is  a  Gaussian 
kernel  given  by 


G{x,x',y,y',t)  = 


1  (x-x')^+(y-v')‘^ 

.  e  4c< 
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(3) 


The  Green’s  function  G{x,x',y,y',t)  is  the  ker¬ 
nel  of  the  integral  operator  which  is  the  inverse 
of  the  diffusion  operator.  Thus,  convolution 
with  larger  scale  Gaussian  kernels  is  equivalent 

^More  accurately,  the  Green’s  function  is  the  Gaus¬ 
sian  multiplied  by  a  temporal  step  function  [Barton, 
1989]. 


to  the  evolution  of  the  diffusion  equation  on  an 
infinite  domain  for  longer  periods  of  time,  with 
the  original  image  as  initial  conditions: 

I{x,y,t)  =  J  j  I{x',y',0)Gt{x,x',y,y',VI)dx'dy' 

(4) 

where  D  is  the  image  domain  and  we  subscript 
the  kernel  function  Gt{x,x' ,y,y' ,VI)  with  t  to 
emphasize  the  fact  that  different  kernel  func¬ 
tions  exist  for  different  evolution  times. 

9.2  Anisotropic  Diffusion  and 
Nonlinear  Filtering 

The  nonlinear  diffusion  equation  yields  impres¬ 
sive  image  enhancement  by  selectively  averag¬ 
ing  intensity  values  from  one  side  of  an  edge  or 
the  other,  but  not  both.  This  can  be  seen  be 
tracking  the  path  through  which  intensity  val¬ 
ues  diffuse  as  the  equation  is  integrated,  then 
viewing  them  as  kernels.  The  spatial  integra¬ 
tion  of  the  resulting  kernels  with  the  initial  im¬ 
age  will  exactly  mirror  the  numerical  integration 
of  the  diffusion  PDE.  Carrying  this  procedure 
out  effectively  lets  one  view  the  nonlinear  fil¬ 
ter  enacted  by  the  integration  of  the  diffusion 
equation. 

In  order  to  build  the  equivalent  filter  we  must 
first  specify  a  numerical  implementation  of  (1). 
We  use  a  simple  scheme  derived  in  Fischl  and 
Schwartz  [1997a].  Given  the  initial  image  at 
time  to,  the  image  at  time  to  +  At  can  be  gener¬ 
ated  by  correlating  the  initial  image  with  a  set 
of  space  and  time  varying  masks: 

I{x,y,to+^t)  w  EE  ,y')I{x+x' ,y+y' ,  to) 

x'  y’ 

(5) 

where  the  mask  weights  are  given  by 

0  c^(to)  0  ■ 

c^(io)  ^-(Ei^oC‘(io))  c^(to) 

0  c^(to)  0 

(6) 

In  order  to  construct  the  diffusion  kernels  C*  (i) 
which  parallelize  the  diffusion,  we  proceed  in¬ 
ductively.  For  each  point  x  in  the  image,  we 
create  a  kernel  Cx  and  initialize  it  using  a  Kro- 
necker  delta  function 

=  *  =  {  J  [Jo  P) 


1515 


Figure  13:  Original  image  at  left  shown  to¬ 
gether  with  the  diffusion  kernels  for  the  four  in¬ 
dicated  locations.  The  image  at  the  right  was 
generated  by  spatially  integrating  the  full  set  of 
kernels  with  the  initial  image. 

=  (8) 

i 

The  results  of  carrying  out  this  procedure  on  the 
mean  curvature-based  diffusion  process  of  El- 
Fallah  and  Ford  [1994]  are  shown  in  Figure  13. 

9.3  Results 

A  summary  of  the  results  obtained  by  these  ap¬ 
proaches  is  shown  in  Figure  14.  On  the  left  is 
shown  the  original  figure,  with  a  Canny  edge  op¬ 
erator  applied  to  the  bottom  left  frame.  (The 
choice  of  the  edge  operator  is  simply  to  al¬ 
low  a  visualization  of  the  quality  of  the  edge 
enhancement).  The  full  anisotropic  diffusion 
method  produces  the  benchmark  figure  shown 
second  from  the  left,  with  a  processing  time 
of  roughly  two  minutes  on  a  Pentium  P-6  (180 
MHz).  Third  from  the  left  is  shown  our  Green’s 
Function  Approximator  [Fischl  and  Schwartz, 
1997a],  with  a  processing  time  of  roughly  10  sec¬ 
onds  on  the  same  processor  (i.e.,  10  fold  increase 
in  speed).  Finally,  in  the  last  frame  is  shown  the 
offset  filter  approach  to  nonlinear  diffusion  [Fis¬ 
chl  and  Schwartz,  1997b],  with  a  speed  increase 
of  50  relative  to  the  anisotropic  diffusion,  and 
a  processing  time  of  roughly  2  seconds  on  the 
same  processor. 

In  summary,  we  have  achieve  a  50  fold  increase 
in  processing  time,  while  retaining  the  favorable 
image  appearance  of  nonlinear  diffusion.  The 
final  step  towards  real-time  performance  is  then 
provided  by  applying  these  methods  to  a  space- 


Figure  14:  From  left  to  right:  original, 
anisotropic  diffusion,  gfa,  offset  median. 


variant  image  frame,  yielding  a  full  30  frame/sec 
processing  rate  [Fischl,  Cohen,  and  Schwartz, 
1997b]. 

The  final  section  of  this  summary  will  briefly 
review  the  exponential  chirp  algorithm.  This 
work  is  described  in  detail  in  Bonmassar  and 
Schwartz  [1996a,  1996b,  1996c]. 

Figure  15  shows  an  example  of  an  image  (the  eye 
of  Lenna),  in  log-polar  image  format.  On  the 
left  is  shown  the  original  image  pair  (i.e.,  the  im¬ 
age,  on  the  bottom,  and  its  log-polar  transform 
on  top).  On  the  right  is  shown  the  reconstruc¬ 
tion  of  these  image  pairs  obtained  by  applying 
the  exponential  chirp  transform,  and  then  ap¬ 
plying  the  inverse  exponential  chirp  transform. 
The  forward,  followed  by  inverse,  exponential 
chirp  transform  is  expected  to  produce  the  iden¬ 
tity,  if  the  difficult  space- variant  image  sampling 
and  are  correctly  solved.  This  is  the  case,  as  the 
original  log-polar  and  retinal  image  is  virtually 
indistinguishable  from  from  the  final  result  after 
passage  through  forward  and  inverse  exponen- 
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Original  logmap  image 


Original  image  in  unif.  coord. 
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Reconstruction  of  the  image-logmap 
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Reconstructed  image  in  unif.  coord. 
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Figure  15:  Exponential  chirp  reconstruction. 


tial  chirp. 

Figure  16  demonstrates  the  chief  image  pro¬ 
cessing  problem  in  space-variant  vision,  which 
motivated  the  development  of  the  exponential 
chirp  transform.  On  the  top  left  is  shown  a  cen¬ 
trally  fixated  numeral  “5” ,  and  the  left  is  shown 
three  shifted  versions  of  the  “5”.  As  expected 
from  the  properties  of  space-variant  sampling, 
the  bottom  left  frame  shows  the  degradation 
in  resolution  due  to  shifting  the  object  away 
from  the  center  of  gaze.  On  top  is  shown  the 
corresponding  log-polar  representations  of  these 
image  frames.  On  the  right,  it  can  be  seen 
that  the  log-polar  representation  of  the  shift¬ 
ing  “5”  in  the  image  plane  changes  both  in  size 
and  shape.  This  aspect  of  space-variant  imag¬ 
ing  is  one  of  the  chief  algorithmic  problems  in 
the  field,  since  even  simple  pattern  matching 
operations  are  significantly  complicated  by  the 
space-variant  geometry  of  the  sensor. 

Figure  17  shows  the  application  of  a  correlation 
operator,  based  on  the  application  of  the  expo¬ 
nential  chirp  algorithm.  It  can  be  seen  that  the 
target  (“5”  in  this  case)  is  matched  with  good 
signal  to  noise  at  all  positions  in  the  log-polar 
plane.  This  matching  operation  is  performed, 
via  the  “fast”  exponential  chirp  algorithm,  at 
30  frames/second  on  a  Pentium  P-6  processor 
[Bonmassar  and  Schwartz,  1996c]. 

In  summary,  by  combining  fast  anisotropic  non¬ 
linear  diffusion  methods  with  the  fast  exponen¬ 
tial  chirp  transform,  it  is  possible  to  achieve 
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Logmap  of  an  image 
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Figure  16:  Pattern  matching  with  exponential 
chirp. 


frame  rate  processing  (30  frames/second)  on 
current  generation  (e.g.  Pentium  P-6)  proces¬ 
sors.  The  demonstration  of  these  algorithms, 
implemented  on  a  miniature  space-variant  ac¬ 
tive  vision  system,  will  take  place  within  the 
current  year. 

10  Robotic  Navigation  under  Visual 
Guidance 

This  project  applies  neural  network  models  to 
the  tasks  of  visual  recognition  and  navigation 
using  mobile  robots.  The  long-term  goal  is  to 
utilize  mobile  robots  as  platforms  with  which 
to  test  and  refine  neural  network  models  devel¬ 
oped  in  the  department  of  Cognitive  and  Neural 
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Systems.  The  work  emphasizes  self-organizing, 
unsupervised  architectures  that  can  ultimately 
afford  greater  robustness  and  flexibility  than 
traditional  approaches.  This  section  describes 
some  preliminary  results. 

10.1  Extracting  Depth  Information 
without  Camera  Calibration 

As  a  first  step,  P.  Gaudiano  and  E.  Sahin  have 
developed  a  method  for  real-time  object  local¬ 
ization  from  monocular  camera  motion,  and 
tested  it  using  a  mobile  robot  equipped  with  an 
on-board  camera  and  image  processing  system. 
The  goal  is  to  integrate  this  method  with  an 
unsupervised  neural  controller  that  learns  the 
forward  and  inverse  odometry  of  a  differential- 
drive  mobile  robot  [Gaudiano,  Zalama,  Lopez- 
Coronado,  1996].  The  unsupervised  controller 
needs  to  obtain  information  about  the  position 
of  targets  in  the  environment:  the  algorithm  de¬ 
scribed  below  relies  on  movement  and  does  not 
require  camera  calibration. 

Localization  of  an  object  within  an  egocentric 
coordinate  frame  is  computed  using  the  loom¬ 
ing  effect,  which  relates  the  change  in  the  size  of 
an  object  to  the  change  in  the  camera  position 
when  the  camera  is  moving  perpendicular  to  the 
object’s  plane.  The  looming  method  is  usually 
preferable  to  optical  flow  methods  because  of  its 
robustness,  computational  simplicity  and  inde¬ 
pendence  of  camera  calibration. 

The  angle  between  the  object  and  the  robot’s 
direction  is  also  easily  computed,  in  this  case 
using  the  horizontal  position  of  the  object  on 
the  image.  Although  the  quantitative  compu¬ 
tation  of  this  angle  requires  camera  calibration, 
the  uncalibrated  information  is  useful  for  the 
unsupervised  neural  controller. 

The  looming  method  has  been  implemented 
on  the  Pioneer  1  (Figure  18A)  by  Real  World 
Interface,  a  two-wheel  differential-drive  mobile 
robot,  equipped  with  a  color  CCD  camera  and  a 
Cognachrome  2000  color  vision  system  by  New¬ 
ton  Research  Labs.  This  vision  system  can  iden¬ 
tify  and  track  “blobs”  of  user-selected  color  at  a 
30Hz  frame  rate.  Specifically,  the  vision  system 
returns  the  coordinates  of  the  centroid,  width. 


and  height  of  the  blobs  in  the  image.  For  the 
looming  method,  these  quantities  are  tracked  as 
the  robot  moves  directly  toward  or  away  from 
the  object. 

When  the  robot’s  displacements  are  known 
through  internal  odometry,  the  object’s  dis¬ 
tance  can  be  derived  by  comparing  as  few  as 
two  frames,  which  on  our  system  are  processed 
at  a  rate  of  lOHz,  except  if  the  object  is  very  far 
or  if  the  robot  moves  slowly.  Then  more  frames 
may  be  required.  Figure  18B  illustrates  typical 
performance:  the  robot  moves  back  and  forth  at 
a  velocity  of  15mm/sec,  starting  at  a  distance  of 
about  1,000mm  from  an  object.  Once  the  robot 
has  moved  about  50mm,  the  distance  estimate 
is  already  within  about  20mm  of  the  actual  dis¬ 
tance.  One  goal  is  to  combine  this  technique 
with  the  Gaudiano  et  al.  [1996]  model  of  unsu¬ 
pervised  robot  control  so  that  the  robot’s  odom¬ 
etry  can  be  learned,  rather  than  having  to  be 
obtained  through  external  calibration.  Another 
goal  is  to  cast  the  same  algorithm  in  the  form 
of  an  adaptive  neural  network,  whereby  learning 
of  the  relationship  between  looming  and  depth 
can  be  carried  out  in  a  totally  unsupervised,  au¬ 
tonomous  fashion. 

10.2  A  Neural  Model  of  Attentive 
Vision 

P.  Gaudiano  and  A.  Harner  are  developing  a 
neural  network  model  of  attentive  visual  search, 
whereby  to  speed  the  selective  processing  of 
complex  scenes.  The  project  adapts  the  Spatial 
Object  Search  (SOS)  model  of  Grossberg,  Min- 
golla,  and  Ross  [1994].  It  consists  of  two  stages: 
a  pre-attentive  processing  stage  that  operates 
locally  and  in  parallel  over  a  low-resolution  im¬ 
age  of  the  visual  field,  followed  by  attentive  pro¬ 
cessing  that  operates  on  small,  high-resolution 
windows  in  series. 

The  first  stage  forms  a  number  of  low  resolu¬ 
tion  feature  maps  by  convolving  the  image  with 
a  set  of  feature  filters.  Because  the  model  aims 
at  machine  vision  applications,  the  model  uses  a 
fast  oriented  pyramid.  The  low-resolution  fea¬ 
ture  maps  are  then  gated  by  a  feature  vector 
representing  the  total  amount  of  each  feature 
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Figure  18:  (A):  The  Pioneer  1  mobile  robot.  (B):  Real  distance  (solid)  and  distance  estimated 
with  the  looming  method  (dashed). 


stored  in  long-term  memory  for  a  given  target. 
The  gated  feature  maps  are  then  combined  to 
form  a  composite  gated  feature  map.  After  this 
map  is  smoothed,  the  peaks  on  this  map  mark 
regions  of  interest  to  be  processed  in  more  detail 
by  the  second  stage  of  the  model.  The  second 
stage  positions  a  high-resolution,  mobile  win¬ 
dow  over  these  peaks.  This  window  captures 
the  role  of  attention  in  selecting  and  position¬ 
ing  the  fovea  over  areas  of  interest.  A  template 
match  is  then  performed  on  images  within  the 
high-resolution  window  using  a  network  similar 
to  ART2a,  which  learns  to  recognize  the  target 
features  to  be  recognized. 

The  model  is  robust  and  fast  when  working  with 
large  noisy  images.  It  is  able  to  find  targets  con¬ 
sisting  of  a  conjunction  of  several  features  in  a 
scene  with  multiple  distractors  sharing  some  of 
those  features.  When  a  significant  amount  of 
noise  is  added  to  the  system,  it  still  finds  the 
target  (as  shown  in  Figure  19(A)).  The  model’s 
response  time  resembles  the  reaction  time  (ver¬ 
sus  number  of  distractors)  curves  seen  in  psy¬ 
chophysical  experiments.  When  the  target  and 
distractors  do  not  share  many  features,  it  mim¬ 
ics  the  “pop-out”  effect  by  finding  the  target 
quickly  and  independently  of  the  number  of  dis¬ 
tractors.  In  addition,  the  model  is  able  to  find 
and  recognize  textured  objects  in  textured  back¬ 
grounds  to  a  limited  degree  without  performing 
a  pre-attentive  segmentation  process  (as  shown 
in  Figure  19(B)).  The  system  takes  only  about 


8  seconds  to  train  and  10  seconds  to  test  on 
256  X  256  images  in  MATLab  running  on  a  Sun 
Sparc  10  workstation. 

11  Neuromorphic  VLSI  for  Battle 
Awareness 

Despite  many  years  of  research  and  remark¬ 
able  advances  in  the  field,  the  problem  of 
robust  automatic  object  recognition  by  truly 
autonomous,  highly  mobile  and  portable  hard¬ 
ware  systems,  is  still  an  open  problem  [U.S. 
Array  Research  Office,  1994].  Biological  or¬ 
ganisms  excel  at  solving  problems  in  sensory 
communication — audition  and  vision — and  mo¬ 
tor  control,  by  sustaining  high  computational 
throughput  with  minimal  energy  consumption 
and  heat  production.  Biological  systems  are 
highly  mobile  and  thus  constrained  by  size, 
weight,  and  the  availability  of  energy  resources. 
Thus  it  may  be  worthwhile  to  implement  biolog¬ 
ical  information  processing  principles  in  VLSI 
designs. 

Here  we  describe  our  level  of  effort  in  develop¬ 
ing  sensory  processing  hardware  for  automated 
sensing  and  vision  systems  at  the  Boston  Uni¬ 
versity  VLSI  and  Neural  Net  System  Labora¬ 
tory  in  collaboration  with  Johns  Hopkins’  Sen¬ 
sory  Communication  and  Analog  VLSI  Labo¬ 
ratories.  More  specifically,  we  are  developing 
real-time  computational  engines  for  image  pre¬ 
processing,  learning,  and  adaptive  classification 
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Figure  19:  (A)  Left:  A  test  image  with  300%  noise.  Right:  The  system  finds  the  target  (the  white 
cross)  in  one  try  and  marks  it  with  a  black  ‘X’.  (B)  Left:  A  textured  test  image.  Right:  Again,  the 
system  finds  the  target. 


based  on  the  BCS/FCS  architecture  and  the 
ART  family  of  learning  networks,  respectively. 
As  a  demonstration  vehicle  and  case  study  prob¬ 
lem  we  focus  on  a  real-time  visual  processing 
system  to  enhance  Synthetic  Aperture  Radar 
(SAR)  sensor  data  for  viewing  by  human  ob¬ 
servers.  The  processed  SAR  image  will  be  re¬ 
ceived  in  a  buffer  for  use  by  Intelligent  Analysts 
or  automatic  machine  classifiers. 

Our  work  bridges  three  clearly  established  dis¬ 
ciplines:  (1)  neural  and  cognitive  science,  (2) 
VLSI  signal  processing,  and  (3)  computer  ar¬ 
chitecture.  Engineers  doing  work  on  computer 
architecture  don’t  know  much  about  neural  sys¬ 
tems.  Similarly  researchers  in  computational 
neuroscience  don’t  know  much  about  VLSI  sig¬ 
nal  processing  and  computer  architecture.  Most 
computer  architecture  work  is  done  today  in 
companies  driven  by  commodity  markets  and 
not  by  highly  efficient  systems  for  specialized 
problems;  this  is  likely  to  change  as  the  avail¬ 
ability  of  design  tools  for  custom  and  semi¬ 
custom  integrated  circuit  design  have  reached 
maturity.  The  work  reported  here  is  orthogo¬ 
nal  to  on-going  VLSI  signal  processing  and  ar¬ 
chitecture  for  video  processing  using  specialized 
digital  signal  processor  architectures  (VSPs). 

In  the  remainder  of  this  section  we  discuss  in 
detail  our  work  on  VLSI  architectures  for  the 
BCS/FCS  and  for  the  ART  family  of  learning 
systems. 


11.1  VLSI  Architectures  for  the 
BCS/FCS  Model 

Work  in  this  subtask  of  the  project  has  focused 
on  the  following  basic  questions:  (1)  Can  we 
simplify  mathematical  functions  of  the  model 
to  map  on  parallel  analog  hardware  (resistive 
grids)  and  how  do  these  simplifications  affect 
performance?  (2)  What  is  the  appropriate  sig¬ 
nal  representation  and  coding  at  each  stage  of 
the  model  and  how  do  we  solve  the  global  inter¬ 
connect  problem? 

The  original  BCS/FCS  model  is  a  multi-stage, 
multi-scale  visual  processing  system  depicted 
in  Figure  20.  Currently,  our  simulations  have 
shown  that  a  reduced  BCS/FCS  model  using 
only  stages  1,  2,  3  and  9  would  speed  proto¬ 
type  development  while  yielding  good  results. 
It  should  be  emphasized  that  when  we  refer  to 
“simplifications”  throughout  the  discussion  that 
follows  these  are  done  after  extensive  simula¬ 
tions  and  analysis  of  the  original  models  and 
more  often  than  not,  simplifications  are  used  as 
stepping  stones  for  more  complicated  model  de¬ 
velopment. 

11.2  Shunting  Neuron  and  Resistive 
Grids 

The  first  question  that  has  been  addressed 
and  reported  in  Hinck  and  Hubbard  [1996]  re¬ 
lates  to  the  implementation  of  shunting  neurons 
and  the  Gaussian  kernels.  Resistive  grids  [An- 
dreou  et  al,  1995;  Boahen  and  Andreou,  1992; 
Mead,  1989]  are  powerful  computational  prim- 
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6:  Long  Range 
Cooperation 


Figure  20:  Block  diagram  of  a  single-scale 
BCS/FCS  model.  Stage  2  through  Stage  8  are 
represented  by  single  blocks  except  each  con¬ 
tain  K  sub-blocks.  These  stages  contain  spatial 
filters  that  are  rotated  by  an  angle  determined 
by  the  number  of  K  orientation.  The  prototype 
will  contain  4  orientation  initially,  although  8 
orientations  are  anticipated  as  the  prototype  is 
revised. 


itives  that  can  yield  Gaussian  like  kernels.  We 
address  these  issues  by  focusing  on  the  the  first 
stage  in  the  BCS/FCS  architecture. 

Our  “shunting  neuron”  circuitry  is  based  on 
the  Hodgkin-Huxley  cell  electrical  equivalence 
model.  The  model  is  composed  of  three 
branches:  a  resting  potential,  a  excitatory  and 
inhibitory  branch.  In  the  original  model,  the 
excitatory  and  inhibitory  branches  were  vari¬ 
able  conductances  whose  magnitude  it  com¬ 
puter  using  a  Gaussian  convolution  operator. 
The  variable  conductances  were  implemented 
using  modified  four-quadrant  voltage-to-current 
multipliers  where  one  input  was  used  as  a  gain 
factor  and  the  other  as  a  differential  voltage. 

The  final  shunting  neuron  was  realized  by  elim¬ 
inating  the  resting  potential  branch  and  trans¬ 
forming  into  a  pull-up  pull-down  structure.  Our 
simulation  results  from  an  array  of  “shunting 
neurons”  provides  evidence  that  our  analog  re¬ 
sistive  grid  based  circuitry  works  as  a  close  ap¬ 
proximation  to  the  original  BCS/FCS  specifica¬ 
tion. 

The  comparison  between  Gaussian  kernels  and 
resistive  grid  based  kernels  was  done  by  simu¬ 
lating  Stage  1  of  the  BCS/FCS  architecture.  A 
wide  dynamic  range  input  function,  as  depicted 
in  Figure  21  to  test  the  circuit’s  dynamic  com¬ 
pression  and  edge  enhancement  abilities.  As  a 
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Figure  21:  Comparison  between  Gaussian  ker¬ 
nels  and  resistive  grid  based  processing. 


benchmark,  we  build  two  mathematical  MAT- 
LAB  model  based  on  the  original  BCS/FCS 
with  either  a  Gaussian  or  Exponential  filter  (re¬ 
sistive  grid  based  kernel). 

Comparing  the  BCS/FCS  (Exponential,  MAT- 
LAB)  model  output  with  our  circuit  output 
shows  they  are  the  same  for  larger  inputs  and 
slightly  different  for  smaller  inputs.  The  dif¬ 
ference  for  smaller  inputs  can  be  attributed  to 
have  different  parameter  values,  because  some 
of  the  circuit  parameters  value  were  set  inher¬ 
ently  by  the  circuit  themselves.  In  compar¬ 
ing  the  BCS/FCS  (Gaussian,  MATLAB)  model 
with  the  other  two  models,  the  Gaussian  look 
much  smoother.  Besides  from  comparing  the 
circuit’s  output  to  the  other  models,  we  inves¬ 
tigated  the  circuit’s  ability  to  compress  a  large 
dynamic  range  and  provide  edge  enhancement. 
The  circuit  can  compressed  four  orders  of  mag¬ 
nitude  of  input  intensity,  normalized  the  results 
between  1  volt  and  provide  edge  enhancement 
over  all  four  orders  of  magnitude. 

11.3  Simplified  BCS  VLSI 
Architecture 

We  have  explored  implementation  issues 
through  the  simplified  architecture,  shown 
schematically  in  Figure  22,  which  partitions 
the  BCS  model  into  three  levels:  simple  cells, 
complex  and  hypercomplex  cells,  and  bipole 
cells  [Waskiewicz  and  Cauwenberghs,  1997]. 
Complex  cells  compute  unidirectional  gradients 
of  normalized  intensity  obtained  from  the 
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Figure  22:  BCS/FCS  model  of  image  segmen¬ 
tation,  feature  filling,  and  surface  reconstruc¬ 
tion.  Three  layers  are  shown,  representing  sim¬ 
ple,  complex  and  bipole  cells. 


Figure  23:  Hexagonal  arrangement  of  BCS 
pixels,  at  the  level  of  simple  and  complex  cells, 
extending  in  three  directions  rr,  y  and  z  in  the 
focal  plane. 


simple  cells.  Hypercomplex  cells  perform 
spatial  and  directional  competition  (inhibition) 
for  edge  formation.  Bipole  cells  perform 
long-range  cooperation  for  edge  enhancement, 
and  exert  positive  feedback  (excitation)  onto 
the  hypercomplex  cells.  Our  present  implemen¬ 
tation  does  not  include  the  FCS  model,  which 
completes  and  fills  features  through  diffusive 
spatial  filtering  of  the  image  gated  by  the  edges 
formed  in  the  BCS. 

We  adopted  the  BCS  algorithm  for  analog 
continuous-time  implementation  on  a  hexago¬ 
nal  grid,  extending  in  three  directions  x,  y  and  z 
on  the  focal  plane  as  indicated  schematically  in 
Figure  23.  For  notational  convenience,  let  sub¬ 
script  0  denote  the  center  pixel  and  ±x,  ±y  and 
±z  its  six  neighbors.  Components  of  each  com¬ 
plex  cell  “vector”  Cj,  along  three  directions  of 
edge  selectivity,  are  indicated  with  superscript 
indices  x,  y  and 

To  facilitate  testing  the  basic  cells  comprise 
of  a  photosensor  sourcing  a  current  indicating 
light  intensity,  a  normalizing  diffusion  network 
for  the  intensity  currents,  gradient  computation 
nodes,  and  one  pseudo-complex  cell  and  bipole 
cell  for  each  of  the  three  directions. 

The  photosensors  generate  a  current  /j  which 
is  normalized  through  a  diffusive  network  [An- 
dreou  et  al,  1995].  Through  current  mirrors. 


the  currents  li  propagate  in  the  three  direc¬ 
tions  X,  y,  and  z  as  noted  in  Figure  23.  Recti¬ 
fied  finite-difference  gradient  estimates  of  li  are 
obtained  for  each  of  the  three  hexagonal  direc¬ 
tions.  These  gradients  excite  the  complex  cells 


Lateral  inhibition  among  spatially  (?)  and  di¬ 
rectionally  (j)  neighboring  complex  cells  imple¬ 
ments  the  function  of  hypercomplex  cells  for 
edge  sharpening.  The  complex  output  (C/)  is 
inhibited  by  local  complex  cell  outputs  in  the 
two  competing  directions  of  j.  Cq  is  addition¬ 
ally  inhibited  by  complex  cells  of  the  four  near¬ 
est  neighbors  in  competing  locations  i  with  par¬ 
allel  orientation. 

A  directionally  selective  interconnected  diffu¬ 
sive  network  of  bipole  cells  Bj  provides  long 
range  cooperative  feedback,  and  strengthens 
edges  and  curves  while  reducing  false  edges. 
is  excited  by  bipole  interaction  received  from 
the  bipole  cell  Bj  on  the  line  crossing  i  in  the 
same  direction  j. 

The  equations  describing  the  operation  of  the 
(hyper-)complex  cells  in  our  circuit  implemen¬ 
tation  are  as  follows: 

CS  =  3D\h  +  Iy-I^y-I-,\-aiCy  +  C^) 
+  C^y  +  Cl,  +  Cly)  +  /3Bo"  (9) 
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where: 


1.  \Iz  +  ly  —  I-y  —  I-z\  is  the  rectified  gradi¬ 
ent  input  from  the  complex  cells; 

2.  q.{Cq  +  Cq)  is  the  inhibition  by  local  op¬ 
posing  directions; 

3.  a'{C^  +  is  inhibition  from 

non-aligned  neighbors  in  the  same  direc¬ 
tion;  and 

4.  PBq  is  the  excitation  of  long-range  cooper¬ 
ation  by  the  bipole  cell. 

The  constants  a,  a'  and  I3  are  set  independently 
by  externally  applied  bias  voltages. 

The  bipole  cell  resistive  grid  implements  a 
three-fold  directionally  polarized  long-range  dif¬ 
fusive  kernel,  basically  as  follows: 

+  KyCl  +  KzC^  (10) 

where  Kx,  Ky,  and  represent  identical  but 
rotated  bipole  kernels  with  polarization  in  the 
X,  y  and  z  directions.  The  kernels  are  imple¬ 
mented  by  three  linear  networks  of  diffusor  el¬ 
ements  [Andreou  et  ai,  1995;  Boahen  and  An- 
dreou,  1992]  complemented  with  cross-links  of 
adjustable  strength  to  control  the  degree  of  di¬ 
rection  selectivity,  besides  the  spatial  spread  of 
the  kernel.  Finally,  the  result  (2)  is  locally  nor¬ 
malized  using  current-mode  circuitry,  before  it 
is  fed  back  onto  the  complex  cells. 

The  simplified  circuit  diagram  of  the  BCS  cell, 
including  complex  and  bipole  cell  functions  on 
a  hexagonal  grid,  is  shown  in  Figure  24.  The 
complex  cell  portion  in  Figure  24A  combines 
intensities  ly,  I^y,  I^,  and  I^z  received  from 
neighboring  cells  to  compute  the  rectified  gra¬ 
dient  in  (1),  using  standard  current  mirrors  and 
an  absolute  value  circuit.  A  pMOS  load  con¬ 
verts  the  complex  cell  output  into  a  voltage  rep¬ 
resentation  Cq  for  distribution  to  neighboring 
nodes  and  complementary  orientations:  local 
inhibition  for  spatial  and  directional  competi¬ 
tion  in  Figure  24B,  and  long-range  cooperation 
through  the  bipole  layer  in  Figure  24C. 

Voltage  biases  control  the  spatial  extent  and  di¬ 
rectional  selectivity  of  the  interactions,  as  well 


(b)  U>cil  Inhibition 


Figure  24:  Simplified  circuit  schematic  of  one 
BCS  cell  in  the  hexagonal  array.  (A)  Com¬ 
plex  cell  rectified  gradient  calculation.  (B)  Hy¬ 
percomplex  cell  spatial  and  orientational  inhi¬ 
bition.  (C)  Bipole  cell  directional  long  range 
cooperation. 

a.s  the  relative  strength  of  inhibition  and  exci¬ 
tation,  and  the  level  of  renormalization,  for  the 
complex  and  bipole  cells.  The  values  for  go,  gi 
and  52  controlling  the  bipole  kernel  are  set  ex¬ 
ternally  by  applying  gate  bias  voltages  Vgo,  Vgi 
and  Vg2,  respectively.  Likewise,  the  constants 
a,  a'  and  (3  in  (1)  are  set  independently  by  the 
applied  source  voltages  Vsi,  Vs2  and  Vss.  Nor¬ 
malization  of  the  bipole  response  for  improved 
stability  of  edge  formation  is  achieved  by  mod¬ 
ulating  through  an  additional  diffusive  net¬ 
work  that  acts  as  a  localized  Gilbert-type  cur¬ 
rent  normalizer  (not  shown  in  Figure  24). 

A  prototype  12  x  12  pixel  array  has  been  fabri¬ 
cated  through  MOSIS.  The  prototype  pixel  unit 
has  been  designed  for  optimal  testability,  and 
has  not  been  optimized  for  density.  The  pixel 
contains  86  transistors  including  a  phototran¬ 
sistor  (for  evaluation  purposes),  a  large  pMOS 
sample-and-hold  capacitor,  and  three  networks 
of  interconnections  in  each  of  the  three  direc¬ 
tions,  requiring  a  fan-in/fan-out  of  18  currents 
at  the  interface  of  each  pixel  unit.  Testing  has 
already  provided  key  insights  on  further  im¬ 
provements  in  the  architecture. 

We  expect  this  BCS  chip  to  offer  an  impor¬ 
tant  tool  to  further  understand  the  operation 
and  dynamics  of  BCS  and  related  algorithms. 
In  future  research,  we  plan  to  extend  the  de¬ 
sign  to  incorporate  the  Feature  Contour  System 
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(FCS),  and  scale  up  the  dimension  of  the  array 
to  an  appropriate  size  for  smart  vision  and  pat¬ 
tern  recognition  systems,  using  state  of  the  art 
CMOS  technology.  Based  on  the  current  design, 
a  10, 000-pixel  array  in  0.5  pm  CMOS  technol¬ 
ogy  would  fit  a  1  cm^  die. 

11.4  Chip  to  Chip  Communication 

Ultimately,  the  prototype  of  the  BCS/FCS  will 
run  interactively  a  computer  that  handles  load¬ 
ing/unloading  of  the  special  chips  and  computes 
stages  for  which  chips  are  not  currently  devel¬ 
oped.  Currently,  our  raw  SAR  data  is  contained 
in  fiat  files  that  that  is  converted  into  corre¬ 
sponding  analog  data  and  loaded  into  Stage  1. 
Each  chips  will  be  serviced  asynchronous  spatial 
oversampling  data  converters  that  unloads  the 
chip  and  send  the  data  to  the  next  location.  In 
the  case  of  rotating  spatial  filters,  we  have  the 
choice  of  either  rotating  the  filters  electronically 
or  rotating  the  data.  As  a  prototype,  the  latter 
seems  preferable.  Locations  can  be  encoded  in 
a  linear  or  nonlinear  mapping  that  will  be  as¬ 
signed  with  Read  Only  Memory  (ROM)  Chips. 
Preliminary  work  in  this  direction  is  both  excit¬ 
ing  and  promising. 

11.5  VLSI  Architectures  for  the 
ART  Family  of  Learning 
Machines 

In  this  subtask  of  the  project  we  focus  on  in¬ 
telligent  memories  based  on  the  ART  family  of 
learning  machines  and  leveraging  state  of  the 
art  fabrication  technologies  for  digital,  analog 
and  multilevel  floating  gate  storage. 

The  general  chip  architecture  for  memory  based 
processors  is  shown  in  Figure  25.  This  is  a  gen¬ 
eral  floorplan  applicable  to  ARTl,  fuzzy  ART, 
and  ARTMAP  network  models.  The  central  ar¬ 
ray  contains  a  very  dense  matrix  of  the  synap¬ 
tic  cells.  These  can  be  analog  with  continuous 
updated  [Cohen,  Abshire,  and  Cauwenberghs, 
1997],  digital  [Pouliquen,  Andreou,  and  Stro- 
hbehn,  1997;  Serrano-Gotarredona  and  Linares- 
Barranco,  1997].  The  bottom  cells  are  en- 
charged  of  providing  the  appropriate  biasing  to 
the  synaptic  cells.  This  biasing  depends  on 


Figure  25:  VLSI  floorplan  for  ART  network 
processors. 

the  input  pattern.  The  top  cells  function  is 
to  implement  the  learning  rule.  Every  time  a 
row  wins  the  competition  and  is  selected  for 
recode,  its  weight  vector  is  copied  to  the  top 
cells.  This  vector  is  then  updated  taking  into  ac¬ 
count  the  actual  input  pattern,  and  stored  back 
into  the  winning  row.  The  right  cells  perform 
the  vigilance  criterion  and  the  rest  of  compu¬ 
tations  needed  to  obtain  the  choice  functions 
for  the  Winner- Take- All.  The  Winner-Take- All 
selects  the  maximum  and  its  address,  properly 
encoded,  is  sent  to  the  outside  of  the  chip. 

11.6  Multi-Chip  ARTl  Systems 

Crucial  in  any  large  scale  deployment  of  mem¬ 
ory  based  processors  is  a  scalable  design  where 
each  individual  chip  has  appropriate  hooks  to 
interface  to  similar  chips  in  a  multi-chip  mod¬ 
ule.  We  are  exploring  multi-chip  architecture 
issues  using  the  ARTl  memory  based  processor 
chip  developed  at  the  National  Microelectronics 
Center  in  Sevilla,  Spain  [Serrano-Gotarredona 
and  Linares-Barranco,  1997].  This  chip  is  an 
improved  version  of  the  first  ARTl  prototype 
reported  by  Serrano-Gotarrendona  and  Linares- 
Barranco  [1996]  which  improves  manufacturing 
yield  from  6%  to  98%  besides  some  minor  prob¬ 
lems  encountered  in  the  first  design.  This  chip 
implements  an  ARTl  network  with  a  50  nodes 
ARTl  layer  and  a  10  F2  layer.  10  chips  are 
available.  These  chips  can  be  assembled  in  an 
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N  by  M  array  so  that  an  ARTl  with  Nx50  FI 
nodes  and  MxlO  F2  nodes  can  be  assembled. 
Some  preliminary  multichip  prototypes  have 
been  tested  already  that  implement  both  ARTl 
and  ARTMAP  systems  [Serrano-Gotarredona 
and  Linares-Barranco,  1997].  Both  Mrs.  Teresa 
Serrano-Gotarredona  and  Dr.  Bernabe  Linares- 
Barranco  are  visiting  Johns  Hopkins  during  the 
1996-1997  academic  year  on  a  Fulbright  schol¬ 
arship  and  postdoctoral  fellowship  respectively. 

One  major  problem  when  increasing  the  number 
of  nodes  in  the  F2  layer  is  making  the  Winner- 
Take- All  to  preserve  its  precision  when  it  is  split 
among  several  chips.  A  Winner-Take-All  cir¬ 
cuit  requires  a  certain  degree  of  precision  so 
that  all  of  its  inputs  receive  the  same  treatment. 
This  is  difficult  to  achieve  in  practice  with  high 
precision.  The  reason  is  that  technological  pa¬ 
rameters  of  transistors  suffer  very  large  changes 
from  chip  to  chip,  and  this  would  make  that 
Winner- Take-All  inputs  of  one  chip  would  re¬ 
ceive  a  different  treatment  that  Winner-Take- 
All  inputs  from  another  chip.  In  the  present 
prototype  no  special  Winner-Take- All  design  to 
overcome  this  problem  has  been  included.  How¬ 
ever,  some  efforts  towards  this  goal  have  been 
already  considered  [Serrano-Gotarredona  and 
Linares-Barranco,  1997].  The  idea  is  to  refor¬ 
mulate  the  Winner-Take-All  operation  in  cur¬ 
rent  domain.  This  allows  to  replicate  and  com¬ 
pare  currents  locally  in  one  chip  and  transport 
currents  from  chip  to  chip.  As  long  as  current 
replication  and  comparison  is  maintained  pre¬ 
cise  locally  on  each  chip  overall  multichip  WTA 
precision  will  be  preserved. 

We  have  designed  a  PCI  based  co-processor 
board  and  software  under  Linux  and  X- 
windows.  The  board  has  multiple  ARTl  chips 
so  that  ARTl  systems  of  Nx50  FI  nodes  and 
MxlO  F2  nodes  (with  N-|-Mj=3D10)  can  be  re¬ 
alized.  Based  on  the  experience  of  the  Sevilla 
group  and  our  experience  with  intelligent  mem¬ 
ory  designs  we  are  architecting  the  next  gener¬ 
ation  of  ART  family  chips  with  digital  storage 
and  analog  processing  are  also  been  prototyped 
in  state  of  the  art  CMOS  process. 
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